Logo PTI Logo FedCSIS

Communication Papers of the 18th Conference on Computer Science and Intelligence Systems

Annals of Computer Science and Information Systems, Volume 37

The Grammar and Syntax Based Corpus Analysis Tool For The Ukrainian Language

,

DOI: http://dx.doi.org/10.15439/2023F7698

Citation: Communication Papers of the 18th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. ÅšlÄ™zak (eds). ACSIS, Vol. 37, pages 309–317 ()

Full text

Abstract. This paper provides an overview of a corpus analysis tool - the StyloMetrix for the Ukrainian language. The StyloMetrix incorporates 104 metrics that cover grammatical, stylistic, and syntactic patterns.

References

  1. Bas Aarts, Charles F Meyer, Charles J Alderson, Caroline Clapham, Dianne Wall, and Robert Beard. Livres regus. Canadian Journal of Linguistics/Revue canadienne de linguistique, 40:3, 1995. http://dx.doi.org/10.1177/000842987300300410
  2. Karin Aijmer and Bengt Altenberg. English corpus linguistics. Routledge, 2014. http://dx.doi.org/10.4324/9781315845890
  3. Hubert Baniecki,Wojciech Kretowicz, Piotr Piatyszek, JakubWisniewski, and Przemyslaw Biecek. dalex: Responsible machine learning with interactive explainability and fairness in python. Journal of Machine Learning Research, 22(214):1–7, 2021. http://dx.doi.org/10.48550/arXiv.2012.14406
  4. Douglas Biber. Corpus linguistics and the study of english grammar. Indonesian JELT: Indonesian Journal of English Language Teaching, 1(1):1–22, 2005. http://dx.doi.org/10.25170/ijelt.v1i1.93
  5. Dmytro Chaplynskyi. Introducing UberText 2.0: A corpus of modern Ukrainian at scale. In Proceedings of the Second Ukrainian Natural Language Processing Workshop, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics.
  6. Rochelle Choenni and Ekaterina Shutova. What does it mean to be language-agnostic? probing multilingual sentence encoders for typological properties. arXiv preprint https://arxiv.org/abs/2009.12862, 2020. http://dx.doi.org/10.48550/arXiv.2009.12862
  7. Noam Chomsky. Generative grammar. Studies in English linguistics and literature, 1988.
  8. Peter Collins. It-clefts and wh-clefts: Prosody and pragmatics. Journal of Pragmatics, 38(10):1706–1720, 2006. http://dx.doi.org/10.1016/j.pragma.2005.03.015
  9. Maciej Eder, Jan Rybicki, and Mike Kestemont. Stylometry with r: a package for computational text analysis. The R Journal, 8(1), 2016. http://dx.doi.org/10.32614/RJ-2016-007
  10. Svitlana Galeshchuk, Arval BNP Paribas, and France Rueil-Malmaison. Abstractive summarization for the ukrainian language: Multi-task learning with hromadske. ua news dataset. In The Second Ukrainian Natural Language Processing Workshop (UNLP 2023), page 49, 2023.
  11. Arthur C Graesser, Danielle S McNamara, Max M Louwerse, and Zhiqiang Cai. Coh-metrix: Analysis of text on cohesion and language. Behavior research methods, instruments, & computers, 36(2):193–202, 2004. http://dx.doi.org/10.3758/BF03195564
  12. Sylviane Granger. Automated retrieval of passives from native and learner corpora: precision and recall. Journal of English Linguistics, 25(4):365–374, 1997. http://dx.doi.org/10.1177/007542429702500410
  13. Sergiu Hart. Shapley value. Springer, 1989.
  14. Yurii Laba, Volodymyr Mudryi, Dmytro Chaplynskyi, Mariana Romanyshyn, and Oles Dobosevych. Contextual embeddings for ukrainian: A large language model approach to word sense disambiguation. In Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP), pages 11–19, 2023.
  15. Piaseck Maciej, Walkowiak Tomasz, and Eder Maciej. Open stylometric system websty: Integrated language processing, analysis and visualisation. Computational Methods in Science and Technology, 24(1):43–58, 2018. http://dx.doi.org/10.12921/cmst.2018.0000007
  16. Christian Mair. Quantitative or qualitative corpus analysis? Infinitival complement clauses in the survey of English usage corpus. Johansson and Stenstrom (eds.), pages 67–80, 1991. http://dx.doi.org/10.1515/9783110865967.67
  17. Danielle S McNamara, Arthur C Graesser, Philip M McCarthy, and Zhiqiang Cai. Automated evaluation of text and discourse with Coh-Metrix. Cambridge University Press, 2014. http://dx.doi.org/10.1017/CBO9780511894664
  18. Rahul Mehta andVasudevaVarma. Llm-rmat semeval-2023 task 2: Multilingual complex ner using xlm-roberta. arXiv preprint https://arxiv.org/abs/2305.03300, 2023. http://dx.doi.org/10.48550/arXiv.2305.03300
  19. Benjamin Minixhofer, Fabian Paischer, and Navid Rekabsaz. WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3992–4006, Seattle, United States, July 2022. Association for Computational Linguistics.
  20. Tempestt Neal, Kalaivani Sundararajan, Aneez Fatima, Yiming Yan, Yingfei Xiang, and Damon Woodard. Surveying stylometry techniques and applications. ACM Comput. Surv, 50(6), nov 2017. http://dx.doi.org/10.1145/3132039
  21. Inez Okulska and Anna Zawadzka. Styles with benefits. the stylometrix vectors for stylistic and semantic text classification of small-scale datasets and different sample length.
  22. Lingwei Ouyang, Qianxi Lv, and Junying Liang. Coh-metrix model-based automatic assessment of interpreting quality. Testing and assessment of interpreting: Recent developments in China, pages 179–200, 2021. http://dx.doi.org/ 10.1007/978-981-15-8554-8_9
  23. Dmytro Panchenko, Daniil Maksymenko, Olena Turuta, Mykyta Luzan, Stepan Tytarenko, and Oleksii Turuta. Ukrainian news corpus as text classification benchmark. In ICTERI 2021 Workshops: ITER, MROL, RMSEBT, TheRMIT, UNLP 2021, Kherson, Ukraine, September 28– October 2, 2021, Proceedings, pages 550–559. http://dx.doi.org/ 10.1007/978-3-031-14841-5_37
  24. Andre Quispesaravia, Walter Perez, Marco Sobrevilla Cabezudo, and Fernando Alva-Manchego. Coh-metrix-esp: A complexity analysis tool for documents written in spanish. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC16), pages 4694–4698, 2016.
  25. Carolina Scarton and Sandra Maria Aluısio. Coh-metrix-port: a readability assessment tool for texts in brazilian portuguese. In Proceedings of the 9th International Conference on Computational Processing of the Portuguese Language, Extended Activities Proceedings, PROPOR, volume 10. sn, 2010. http://dx.doi.org/ 10.1007/978-3-642-16952-6_31
  26. Stefan Schweter. Ukrainian electra model, November 2020.
  27. Oleksiy Syvokon and Olena Nahorna. Ua-gec: Grammatical error correction and fluency corpus for the Ukrainian language, 2021. http://dx.doi.org/10.48550/arXiv.2103.16997
  28. A Tall’on-Ballesteros and C Chen. Explainable ai: Using shapley value to explain complex anomaly detection ml-based systems. Machine learning and artificial intelligence, 332:152, 2020.
  29. Gunnel Tottie. Lexical diffusion in syntactic change: Frequency as a determinant of linguistic conservatism in the development of negation in English. Historical English syntax, pages 439–467, 1991. http://dx.doi.org/10.1515/9783110863314.439