The Grammar and Syntax Based Corpus Analysis Tool For The Ukrainian Language
Daria Stetsenko, Inez Okulska
DOI: http://dx.doi.org/10.15439/2023F7698
Citation: Communication Papers of the 18th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 37, pages 309–317 (2023)
Abstract. This paper provides an overview of a corpus analysis tool - the StyloMetrix for the Ukrainian language. The StyloMetrix incorporates 104 metrics that cover grammatical, stylistic, and syntactic patterns.
References
- Bas Aarts, Charles F Meyer, Charles J Alderson, Caroline Clapham, Dianne Wall, and Robert Beard. Livres regus. Canadian Journal of Linguistics/Revue canadienne de linguistique, 40:3, 1995. http://dx.doi.org/10.1177/000842987300300410
- Karin Aijmer and Bengt Altenberg. English corpus linguistics. Routledge, 2014. http://dx.doi.org/10.4324/9781315845890
- Hubert Baniecki,Wojciech Kretowicz, Piotr Piatyszek, JakubWisniewski, and Przemyslaw Biecek. dalex: Responsible machine learning with interactive explainability and fairness in python. Journal of Machine Learning Research, 22(214):1–7, 2021. http://dx.doi.org/10.48550/arXiv.2012.14406
- Douglas Biber. Corpus linguistics and the study of english grammar. Indonesian JELT: Indonesian Journal of English Language Teaching, 1(1):1–22, 2005. http://dx.doi.org/10.25170/ijelt.v1i1.93
- Dmytro Chaplynskyi. Introducing UberText 2.0: A corpus of modern Ukrainian at scale. In Proceedings of the Second Ukrainian Natural Language Processing Workshop, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics.
- Rochelle Choenni and Ekaterina Shutova. What does it mean to be language-agnostic? probing multilingual sentence encoders for typological properties. arXiv preprint https://arxiv.org/abs/2009.12862, 2020. http://dx.doi.org/10.48550/arXiv.2009.12862
- Noam Chomsky. Generative grammar. Studies in English linguistics and literature, 1988.
- Peter Collins. It-clefts and wh-clefts: Prosody and pragmatics. Journal of Pragmatics, 38(10):1706–1720, 2006. http://dx.doi.org/10.1016/j.pragma.2005.03.015
- Maciej Eder, Jan Rybicki, and Mike Kestemont. Stylometry with r: a package for computational text analysis. The R Journal, 8(1), 2016. http://dx.doi.org/10.32614/RJ-2016-007
- Svitlana Galeshchuk, Arval BNP Paribas, and France Rueil-Malmaison. Abstractive summarization for the ukrainian language: Multi-task learning with hromadske. ua news dataset. In The Second Ukrainian Natural Language Processing Workshop (UNLP 2023), page 49, 2023.
- Arthur C Graesser, Danielle S McNamara, Max M Louwerse, and Zhiqiang Cai. Coh-metrix: Analysis of text on cohesion and language. Behavior research methods, instruments, & computers, 36(2):193–202, 2004. http://dx.doi.org/10.3758/BF03195564
- Sylviane Granger. Automated retrieval of passives from native and learner corpora: precision and recall. Journal of English Linguistics, 25(4):365–374, 1997. http://dx.doi.org/10.1177/007542429702500410
- Sergiu Hart. Shapley value. Springer, 1989.
- Yurii Laba, Volodymyr Mudryi, Dmytro Chaplynskyi, Mariana Romanyshyn, and Oles Dobosevych. Contextual embeddings for ukrainian: A large language model approach to word sense disambiguation. In Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP), pages 11–19, 2023.
- Piaseck Maciej, Walkowiak Tomasz, and Eder Maciej. Open stylometric system websty: Integrated language processing, analysis and visualisation. Computational Methods in Science and Technology, 24(1):43–58, 2018. http://dx.doi.org/10.12921/cmst.2018.0000007
- Christian Mair. Quantitative or qualitative corpus analysis? Infinitival complement clauses in the survey of English usage corpus. Johansson and Stenstrom (eds.), pages 67–80, 1991. http://dx.doi.org/10.1515/9783110865967.67
- Danielle S McNamara, Arthur C Graesser, Philip M McCarthy, and Zhiqiang Cai. Automated evaluation of text and discourse with Coh-Metrix. Cambridge University Press, 2014. http://dx.doi.org/10.1017/CBO9780511894664
- Rahul Mehta andVasudevaVarma. Llm-rmat semeval-2023 task 2: Multilingual complex ner using xlm-roberta. arXiv preprint https://arxiv.org/abs/2305.03300, 2023. http://dx.doi.org/10.48550/arXiv.2305.03300
- Benjamin Minixhofer, Fabian Paischer, and Navid Rekabsaz. WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3992–4006, Seattle, United States, July 2022. Association for Computational Linguistics.
- Tempestt Neal, Kalaivani Sundararajan, Aneez Fatima, Yiming Yan, Yingfei Xiang, and Damon Woodard. Surveying stylometry techniques and applications. ACM Comput. Surv, 50(6), nov 2017. http://dx.doi.org/10.1145/3132039
- Inez Okulska and Anna Zawadzka. Styles with benefits. the stylometrix vectors for stylistic and semantic text classification of small-scale datasets and different sample length.
- Lingwei Ouyang, Qianxi Lv, and Junying Liang. Coh-metrix model-based automatic assessment of interpreting quality. Testing and assessment of interpreting: Recent developments in China, pages 179–200, 2021. http://dx.doi.org/ 10.1007/978-981-15-8554-8_9
- Dmytro Panchenko, Daniil Maksymenko, Olena Turuta, Mykyta Luzan, Stepan Tytarenko, and Oleksii Turuta. Ukrainian news corpus as text classification benchmark. In ICTERI 2021 Workshops: ITER, MROL, RMSEBT, TheRMIT, UNLP 2021, Kherson, Ukraine, September 28– October 2, 2021, Proceedings, pages 550–559. http://dx.doi.org/ 10.1007/978-3-031-14841-5_37
- Andre Quispesaravia, Walter Perez, Marco Sobrevilla Cabezudo, and Fernando Alva-Manchego. Coh-metrix-esp: A complexity analysis tool for documents written in spanish. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC16), pages 4694–4698, 2016.
- Carolina Scarton and Sandra Maria Aluısio. Coh-metrix-port: a readability assessment tool for texts in brazilian portuguese. In Proceedings of the 9th International Conference on Computational Processing of the Portuguese Language, Extended Activities Proceedings, PROPOR, volume 10. sn, 2010. http://dx.doi.org/ 10.1007/978-3-642-16952-6_31
- Stefan Schweter. Ukrainian electra model, November 2020.
- Oleksiy Syvokon and Olena Nahorna. Ua-gec: Grammatical error correction and fluency corpus for the Ukrainian language, 2021. http://dx.doi.org/10.48550/arXiv.2103.16997
- A Tall’on-Ballesteros and C Chen. Explainable ai: Using shapley value to explain complex anomaly detection ml-based systems. Machine learning and artificial intelligence, 332:152, 2020.
- Gunnel Tottie. Lexical diffusion in syntactic change: Frequency as a determinant of linguistic conservatism in the development of negation in English. Historical English syntax, pages 439–467, 1991. http://dx.doi.org/10.1515/9783110863314.439