Logo PTI Logo FedCSIS

Proceedings of the 19th Conference on Computer Science and Intelligence Systems (FedCSIS)

Annals of Computer Science and Information Systems, Volume 39

SrpCNNeL: Serbian Model for Named Entity Linking

, , , ,

DOI: http://dx.doi.org/10.15439/2024F8827

Citation: Proceedings of the 19th Conference on Computer Science and Intelligence Systems (FedCSIS), M. Bolanowski, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 39, pages 465473 ()

Full text

Abstract. This paper presents the development of a Named Entity Linking (NEL) model to the Wikidata knowledge base for the Serbian language named SrpCNNeL. The model was trained to recognize and link seven different named entity types (persons, locations, organisations, professions, events, demonyms, and works of art) on the dataset containing sentences from novels, legal documents, as also sentences generated from the Wikidata knowledge base and Leximirka lexical database. The resulting model demonstrated robust performance, achieving an F1 score of 0.8 on the test set. Considering that the dataset contains the highest number of locations linked to the knowledge base, an evaluation was conducted on an independent dataset and compared to the baseline Spacy Entity Linker for locations only.

References

  1. K. Balog, Entity-oriented search. Springer Nature, 2018. https://doi.org/10.1007/978-3-319-93935-3.
  2. W. Shen, Y. Li, Y. Liu, J. Han, J. Wang, and X. Yuan, “Entity Linking Meets Deep Learning: Techniques and Solutions,” 2021. https://doi.org/10.1109/TKDE.2021.3090865.
  3. R. Hanslo, “Evaluation of Neural Network Transformer Models for Named-Entity Recognition on Low-Resourced Languages,” in 2021 16th Conference on Computer Science and Intelligence Systems (FedCSIS), pp. 115–119, 2021. http://dx.doi.org/10.15439/2021F7.
  4. W. Shen, J. Wang, and J. Han, “Entity linking with a knowledge base: Issues, techniques, and solutions,” IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 2, pp. 443–460, 2014. https://dx.doi.org/10.1109/TKDE.2014.2327028.
  5. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186, Association for Computational Linguistics, 2019. https://doi.org/10.18653/v1/N19-1423.
  6. W. Yin, M. Yu, B. Xiang, B. Zhou, and H. Schütze, “Simple Question Answering by Attentive Convolutional Neural Network,” in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers (Y. Matsumoto and R. Prasad, eds.), (Osaka, Japan), pp. 1746–1756, The COLING 2016 Organizing Committee, 2016. https://doi.org/10.48550/arXiv.1606.03391.
  7. T. Lin, Mausam, and O. Etzioni, “Entity linking at web scale,” in Proceedings of the joint workshop on automatic knowledge base construction and web-scale knowledge extraction (AKBC-WEKEX), pp. 84–88, Association for Computational Linguistics, 2012. https://aclanthology.org/W12-3016.
  8. K. Labusch and C. Neudecker, “Named Entity Disambiguation and Linking Historic Newspaper OCR with BERT,” in CLEF (Working Notes), p. 33, CEUR-WS, 2020. http://ceur-ws.org/Vol-2696/paper_163.pdf.
  9. Z. Liu, Y. Leng, M. Wang, and C. Lin, “Named Entity Recognition and Named Entity on Esports Contents,” in 2020 15th Conference on Computer Science and Information Systems (FedCSIS), pp. 189–192, 2020. https://doi.org/10.15439/2020F24.
  10. X. Liu, Y. Li, H. Wu, M. Zhou, F. Wei, and Y. Lu, “Entity linking for tweets,” in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1304–1311, Association for Computational Linguistics, 2013. https://doi.org/10.1142/9789813227927_0019.
  11. E. French and B. T. McInnes, “An overview of biomedical entity linking throughout the years,” Journal of Biomedical Informatics, vol. 137, p. 104252, 2023. https://doi.org/10.1016/j.jbi.2022.104252.
  12. R. Sharma, D. Chauhan, and R. Sharma, “Named Entity Recognition System for the Biomedical Domain,” in 2022 17th Conference on Computer Science and Intelligence Systems (FedCSIS), pp. 837–840, 2022. http://dx.doi.org/10.15439/2022F63.
  13. I. Guellil, A. Garcia-Dominguez, P. R. Lewis, S. Hussain, and G. Smith, “Entity linking for English and other languages: a survey,” Knowledge and Information Systems, pp. 1–52, 2024. https://doi.org/10.1007/s10115-023-02059-2.
  14. M. Neumann, D. King, I. Beltagy, and W. Ammar, “ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing,” in Proceedings of the 18th BioNLP Workshop and Shared Task, Association for Computational Linguistics, 2019. https://doi.org/10.18653/v1/w19-5034.
  15. O. Bodenreider, “The unified medical language system (UMLS): integrating biomedical terminology,” Nucleic acids research, vol. 32, no. suppl_1, pp. D267–D270, 2004. https://doi.org/10.1093/nar/gkh061.
  16. G. O. Consortium, “The Gene Ontology (GO) database and informatics resource,” Nucleic acids research, vol. 32, no. suppl_1, pp. D258–D261, 2004. https://doi.org/10.1093/nar/gkh036.
  17. J. M. Van Hulst, F. Hasibi, K. Dercksen, K. Balog, and A. P. de Vries, “Rel: An entity linker standing on the shoulders of giants,” in Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2197–2200, 2020. https://doi.org/10.1145/3397271.3401416.
  18. N. De Cao, L. Wu, K. Popat, M. Artetxe, N. Goyal, M. Plekhanov, L. Zettlemoyer, N. Cancedda, S. Riedel, and F. Petroni, “Multilingual Autoregressive Entity Linking,” Transactions of the Association for Computational Linguistics, vol. 10, pp. 274–290, 2022. https://doi.org/10.1162/tacl_a_00460.
  19. E. Boros, E. L. Pontes, L. A. Cabrera-Diego, A. Hamdi, J. G. Moreno, N. Sidère, and A. Doucet, “Robust named entity recognition and linking on historical multilingual documents,” in Conference and Labs of the Evaluation Forum (CLEF 2020), vol. 2696, pp. 1–17, CEUR-WS Working Notes, 2020. https://doi.org/10.5281/zenodo.4068075.
  20. K. Papantoniou, V. Efthymiou, and D. Plexousakis, “Automating Benchmark Generation for Named Entity Recognition and Entity Linking,” in European Semantic Web Conference, pp. 143–148, Springer, 2023. https://doi.org/10.1007/978-3-031-43458-7_27.
  21. M. Plekhanov, N. Kassner, K. Popat, L. Martin, S. Merello, B. Kozlovskii, F. A. Dreyer, and N. Cancedda, “Multilingual End to End Entity Linking,” arXiv, 2023. https://doi.org/10.48550/arXiv.2306.08896.
  22. J. Raiman and O. Raiman, “DeepType: Multilingual Entity Linking by Neural Type System Evolution,” 2018. https://doi.org/10.48550/arXiv.1802.01021.
  23. P. Nugues, “Linking Named Entities in Diderot’s Encyclopédie to Wikidata,” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp. 10610–10615, 2024. https://doi.org/10.48550/arXiv.2406.03221.
  24. N. Loukachevitch, E. Artemova, T. Batura, P. Braslavski, V. Ivanov, S. Manandhar, A. Pugachev, I. Rozhkov, A. Shelmanov, E. Tutubalina, et al., “NEREL: a Russian information extraction dataset with rich annotation for nested entities, relations, and wikidata entity links,” Language Resources and Evaluation, pp. 1–37, 2023. https://doi.org/10.1007/s10579-023-09674-z.
  25. A. Delpeuch, “Opentapioca: Lightweight entity linking for wikidata,” arXiv preprint https://arxiv.org/abs/1904.09131, 2019. https://doi.org/10.48550/arXiv.1904.09131.
  26. O. Perisic, S. Ranka, I. N. Milica, Š. Mihailo, et al., “It-Sr-NER: CLARIN Compatible NER and GeoparsingWeb Services for Italian and Serbian Parallel Text,” in Selected Papers from the CLARIN Annual Conference 2022, Czechia, 2022, pp. 99–110, Linköping University Electronic Press, 2023. https://doi.org/10.3384/ecp198010.
  27. O. Perišic, S. Ranka, I. N. Milica, and Š. Mihailo, “ It-Sr-NER: Web Services for Recognizing and Linking Named Entities in Text and Displaying Them on a Web Map,” Infotheca - Journal for Digital Humanities, vol. 23, no. 1, pp. 61–77, 2023. https://doi.org/10.18485/infotheca.2023.23.1.3.
  28. Y. Cao, L. Huang, H. Ji, X. Chen, and J. Li, “Bridge Text and Knowledge by Learning Multi-Prototype Entity Mention Embedding,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1623–1633, Association for Computational Linguistics, 2017. https://doi.org/10.18653/v1/P17-1149.
  29. M. Francis-Landau, G. Durrett, and D. Klein, “Capturing Semantic Similarity for Entity Linking with Convolutional Neural Networks,” in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (K. Knight, A. Nenkova, and O. Rambow, eds.), (San Diego, California), pp. 1256–1261, Association for Computational Linguistics, 2016. https://doi.org/10.18653/v1/N16-1150.
  30. Y. Shi, R. Yang, C. Yin, Y. Lu, Y. Yang, and Y. Tao, “Entity Linking Method for Chinese Short Texts with Multiple Embedded Representations,” Electronics, vol. 12, no. 12, 2023. https://doi.org/10.3390/electronics12122692.
  31. R. Pozzi, R. Rubini, C. Bernasconi, and M. Palmonari, “Named Entity Recognition and Linking for Entity Extraction from Italian Civil Judgements,” in International Conference of the Italian Association for Artificial Intelligence, pp. 187–201, Springer, 2023. https://doi.org/10.1007/978-3-031-47546-7_13.
  32. S. MORAKIS, F. HASIBI, and M. LARSON, “Entity Linking for Greek,” 2021.
  33. R. Stanković, C. Krstev, B. Š. Todorović, and M. Škorić, “Annotation of the Serbian ELTeC Collection,” Infotheca–Journal for Digital Humanities, vol. 21, no. 2, pp. 43–59, 2021. https://doi.org/10.18485/infotheca. 2021.21.2.3.
  34. D. Vrandečić and M. Krötzsch, “Wikidata: a free collaborative knowledgebase,” Communications of the ACM, vol. 57, no. 10, pp. 78–85, 2014. https://doi.org/10.1145/2629489.
  35. B. Lazic and M. Škoric, “From DELA based dictionary to Leximirka lexical database,” Infotheca–Journal for Digital Humanities, vol. 19, no. 2, pp. 00–00, 2019. https://doi.org/10.18485/infotheca.2019.19.2.4.
  36. D. Hernández, A. Hogan, C. Riveros, C. Rojas, and E. Zerega, “Querying Wikidata: Comparing SPARQL, Relational and Graph Databases,” in The Semantic Web–ISWC 2016: 15th International Semantic Web Conference, Kobe, Japan, October 17–21, 2016, Proceedings, Part II 15, pp. 88–103, Springer, 2016. https://doi.org/10.1007/978-3-319-46547-0_10.
  37. D. Vitas, S. Koeva, C. Krstev, and I. Obradović, “Tour du monde through the dictionaries,” in Actes du 27eme Colloque International sur le Lexique et la Gammaire, pp. 249–256, 2008.
  38. C. Krstev, D. Vitas, and A. Trtovac, “Orwells 1984—the Case of Serbian Revisited,” in Proc. of 5th Language & Technology Conference, pp. 25–27, 2011.
  39. R. Stanković, C. Krstev, D. Vitas, N. Vulović, and O. Kitanović, “Keyword-based search on bilingual digital libraries,” in Semantic Keyword-Based Search on Structured Data Sources: COST Action IC1302 Second International KEYSTONE Conference, IKC 2016, Cluj-Napoca, Romania, September 8–9, 2016, Revised Selected Papers, pp. 112–123, Springer, 2017. https://doi.org/10.1007/978-3-319-53640-8_10.
  40. M. Ikonić Nešić, S. Petalinkar, S. Ranka, and Š. Mihailo, “BERT downstream task analysis: Named Entity Recognition in Serbian,” in 14th International Conference on Information Society and Technology – ICIST 2024, unpublished, 2024.
  41. M. Škorić, “Novi jezički modeli za srpski jezik,” Infotheca - Journal for Digital Humanities, 2024. https://doi.org/10.48550/arXiv.2402.14379.
  42. Y. Liu, M. Ott, N. Goyal, et al., “Roberta: A robustly optimized BERT pretraining approach,” arXiv preprint https://arxiv.org/abs/1907.11692, 2019. https: //doi.org/10.48550/arXiv.1907.11692.
  43. J.-C. Klie, M. Bugert, B. Boullosa, et al., “The INCEPTION Platform: Machine-Assisted and Knowledge-Oriented Interactive Annotation,” in Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9, 2018.
  44. B. Šandrih Todorović, C. Krstev, R. Stanković, and M. Ikonić Nešić, “Serbian NER& Beyond: The Archaic and the Modern Intertwinned,” in Deep Learning Natural Language Processing Methods and Applications – Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP 2021) (G. Angelova, M. Kunilovskaya, R. Mitkov, and I. Nikolova-Koleva, eds.), pp. 1252–1260, INCOMA Ltd., September 2021. https://doi.org/10.26615/978-954-452-072-4_141.
  45. R. Stanković, C. Krstev, B. Šandrih Todorović, and M. Škorić, “Annotation of the Serbian ELTeC Collection,” Infotheca - Journal for Digital Humanities, vol. 21, no. 2, pp. 43–59, 2021. https://doi.org/10.18485/infotheca.2021.21.2.3.
  46. B. Lazić and M. Škorić, “ From DELA based dictionary to Leximirka lexical database,” Infotheca - Journal for Digital Humanities, vol. 19, no. 2, pp. 81–98, 2020. https://10.18485/infotheca.2019.19.2.4.