Topic Modeling of the SrpELTeC Corpus: A Comparison of NMF, LDA, and BERTopic

Teodora Mihajlov; Milica Ikonić Nešić; Ranka Stanković; Olivera Kitanović

Topic Modeling of the SrpELTeC Corpus: A Comparison of NMF, LDA, and BERTopic

Teodora Mihajlov, Milica Ikonić Nešić, Ranka Stanković, Olivera Kitanović

DOI: http://dx.doi.org/10.15439/2024F1593

Citation: Proceedings of the 19th Conference on Computer Science and Intelligence Systems (FedCSIS), M. Bolanowski, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 39, pages 649–653 (2024)

Full text

Abstract. Topic modeling is an effective way to gain insight into large amounts of data. Some of the most widely used topic models are Latent Dirichlet allocation (LDA) and Nonnegative Matrix Factorization (NMF). However, with the rise of self- attention models and pre-trained language models, new ways to mine topics have emerged. BERTopic represents the current state-of-the-art when it comes to modeling topics. In this pa- per, we comapred LDA, NMF, and BERTopic performance on literaty texts in Serbian, by measuring Topic Coherency and Topic Diveristy, as well as qualitatively evaluating the topics. For BERTopic, we compared multilingual sentence transofmer embeddings, to the Jerteh-355 monolingual embeddings for Serbian. For TC, NMF yielded the best results, while BERTopic with Jerteh-355 embeddings gave the best TD. Jerteh-355 also outperformed sentence transformers embeddigs in both TC and TD.

References

I. Uglanova and E. Gius, “The order of things. a study on topic modelling of literary texts.” CHR, no. 18-20, p. 2020, 2020.
K. E. Chu, P. Keikhosrokiani, and M. P. Asl, “A topic modeling and sentiment analysis model for detection and visualization of themes in literary texts,” Pertanika Journal of Science & Technology, vol. 30, no. 4, pp. 2535–2561, 2022, https://doi.org/10.47836/pjst.30.4.14.
R. Stanković, C. Krstev, B. Š. Todorović, D. Vitas, M. Škorić, and M. I. Nešić, “Distant reading in digital humanities: Case study on the serbian part of the eltec collection,” in Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022, pp. 3337–3345. [Online]. Available: https://aclanthology.org/2022.lrec-1.356
C. Schöch, T. Erjavec, R. Patras, and D. Santos, “Creating the european literary text collection (eltec): Challenges and perspectives,” Modern Languages Open, 2021, http://doi.org/10.3828/mlo.v0i0.364.
D. Medvecki, B. Bašaragin, A. Ljajić, and N. Milošević, “Multi-lingual transformer and bertopic for short text topic modeling: The case of serbian,” in Conference on Information Technology and its Applications. Springer, 2024, pp. 161–173, https://doi.org/10.1007/978-3-031-50755-7_16.
C. Schöch, M. Hinzmann, J. Röttgermann, K. Dietz, and A. Klee, “Smart modelling for literary history,” International Journal of Humanities and Arts Computing, vol. 16, no. 1, pp. 78–93, 2022, https://doi.org/10.3366/ijhac.2022.0278.
D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022, 2003.
D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” Advances in neural information processing systems, vol. 13, 2000. [Online]. Available: https://api.semanticscholar.org/CorpusID:2095855
R. Egger and J. Yu, “A topic modeling comparison between lda, nmf, top2vec, and bertopic to demystify twitter posts,” Frontiers in sociology, vol. 7, p. 886498, 2022.
R. Egger and J. Yu, “Identifying hidden semantic structures in instagram data: a topic modelling comparison,” Tourism Review, vol. 77, no. 4, pp. 1234–1246, 2021.
M. Švaňa, “Social media, topic modeling and sentiment analysis in municipal decision support,” in 2023 18th Conference on Computer Science and Intelligence Systems (FedCSIS). IEEE, 2023, pp. 1235–1239, http://dx.doi.org/10.15439/2023F1479.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017, https://doi.org/10.48550/arXiv.1706.03762.
M. Grootendorst, “Bertopic: Neural topic modeling with a class-based tf-idf procedure,” arXiv preprint https://arxiv.org/abs/2203.05794, 2022, https://doi.org/10.48550/arXiv.2203.05794.
N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” arXiv preprint https://arxiv.org/abs/1908.10084, 2019, https://doi.org/10.48550/arXiv.1908.10084.
M. Škorić, “Novi jezički modeli za srpski jezik,” Infoteka, vol. 24, 2024, https://doi.org/10.48550/arXiv.2402.14379. [Online]. Available: https://arxiv.org/abs/2402.14379
A. B. Dieng, F. J. Ruiz, and D. M. Blei, “Topic modeling in embedding spaces,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 439–453, 2020, https://doi.org/10.48550/arXiv.1907.04907.
D. Newman, J. H. Lau, K. Grieser, and T. Baldwin, “Automatic evaluation of topic coherence,” in Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics, 2010, pp. 100–108.
M. Röder, A. Both, and A. Hinneburg, “Exploring the space of topic coherence measures,” in Proceedings of the eighth ACM international conference on Web search and data mining, 2015, pp. 399–408, https://doi.org/10.1145/2684822.2685324.
N. Ljubešić and D. Lauc, “BERTić - the transformer language model for Bosnian, Croatian, Montenegrin and Serbian,” in Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing. Kiyv, Ukraine: Association for Computational Linguistics, Apr. 2021, pp. 37–42, https://doi.org/10.48550/arXiv.2104.09243. [Online]. Available: https://www.aclweb.org/anthology/2021.bsnlp-1.5