AraXLM: Evaluating Arabic Diacritization Tools for Cross-Language Plagiarism Detection
Mona Alshehri, Natalia Beloff, Martin White
DOI: http://dx.doi.org/10.15439/2025F4862
Citation: Proceedings of the 20th Conference on Computer Science and Intelligence Systems (FedCSIS), M. Bolanowski, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 43, pages 451–460 (2025)
Abstract. In recent years, plagiarism detection systems have evolved from basic lexical matching and n-gram overlap methods to Deep Learning (DL) models capable of capturing semantic relationships between texts. While these DL-based approaches have achieved notable success across various languages, their effectiveness in Arabic remains limited due to inherent linguistic ambiguities, particularly the omission of diacritical marks. This absence hinders accurate semantic interpretation and limits the ability of models to detect paraphrased or semantically obfuscated content in Arabic texts. This paper presents an evaluation of Arabic Text Diacritization (ATD) tools as the initial phase of a plagiarism detection framework designed for Arabic--English cross-lingual model text analysis (AraXLM). It describes the first stage of the framework, which focuses on assessing the performance of state-of-the-art ATD tools. An empirical analysis was conducted on six ATD models using Word Error Rate (WER), Diacritic Error Rate (DER), both with and without case endings (CE), and Bilingual Evaluation Understudy (BLEU) metrics. The results show that tools such as Shakkelha produced lower DER and high BLEU values, indicating high accuracy in diacritic restoration, while Fine-Tashkeel demonstrates the lowest WER and highest BLEU, reflecting the best word-level performance. In contrast, CAMeL Tools and Mishkal display comparatively higher error rates across both metrics. These findings suggest that incorporating accurate diacritization models into Arabic NLP tasks, such as Machine Translation (MT) and Plagiarism Detection (PD), improves text normalisation and the quality of semantic embeddings. Thus, the AraXLM framework, supported by effective diacritization pre-processing, enhances linguistically aware detection of plagiarism involving Arabic text, where precise semantic alignment between languages is essential.
References
- M. Elyaakoubi and A. Lazrek, “Justify just or just justify,” Journal of Electronic Publishing, vol. 13, no. 1, 2010. https://dx.doi.org/10.3998/3336451.0013.105.
- R. Rjeily, Cultural Connectives: Bridging the Latin and Arabic Alphabets, vol. 1. Brooklyn, NY: Mark Batty Publisher, 2021.
- M. Hssini and A. Lazrek, “Design of Arabic Diacritical Marks,” International Journal of Computer Science, vol. 8, no. 3, May 2011.
- M. Maamouri, A. Bies, and S. Kulick, 'Diacritization: A Challenge to Arabic Treebank Annotation and Parsing’, the International Conference on the Challenge of Arabic for NLP/MT , pp. 35-47, 2006.
- S. Alzahrani, “Arabic plagiarism detection using word correlation in N-Grams with K-overlapping approach,” Taif, 2015. https://ceur-ws.org/Vol-1587/T5-2.pdf
- E. M. B. Nagoudi et al., “2L-APD: A two-level plagiarism detection system for Arabic documents,” Cybernetics and Information Technologies, vol. 18, no. 1, pp. 124–138, 2018. https://dx.doi.org/10.2478/cait-2018-0011.
- B. Akanksha et al., “A survey on plagiarism detection,” International Journal of Computer Applications, vol. 10, no. 8, pp. 2359–2365, 2017. http://www.ripublication.com
- M. F. Akan et al., “An analysis of Arabic-English translation: Problems and prospects,” Advances in Language and Literary Studies, vol. 10, no. 1, p. 58, Feb. 2019. https://dx.doi.org/10.7575/aiac.alls.v.10n.1p.58.
- M. Alshehri, N. Beloff, and M. White, “AraXLM: New XLM-RoBERTa based method for plagiarism detection in Arabic text,” in Intelligent Computing, K. Arai, Ed., Cham: Springer, 2024, pp. 81–96. https://dx.doi.org/10.1007/978-3-031-62277-9_6
- M. M. Elmallah et al., “Arabic diacritization using morphologically informed character-level model,” in Proc. LREC-COLING 2024, Torino, Italy: ELRA and ICCL, May 2024, pp. 1446–1454. https://aclanthology.org/2024.lrec-main.128/
- K. Shaalan and Khaled, “Rule-based approach in Arabic natural language processing,” International Journal on Information and Communication Technologies, vol. 3, p. 11, May 2010.
- A. Chennoufi and A. Mazroui, “Morphological, syntactic and diacritics rules for automatic diacritization of Arabic sentences,” Journal of King Saud University - Computer and Information Sciences, vol. 29, no. 2, pp. 156–163, 2017. https://dx.doi.org/10.1016/j.jksuci.2016.06.004.
- M. Mézard and J.-P. Nadal, “Learning in feedforward layered networks: the tiling algorithm,” Journal of Physics A, vol. 22, pp. 2191–2203, 1989. https://api.semanticscholar.org/CorpusID:44826720
- W. Almanaseer et al., “A deep belief network classification approach for automatic diacritization of Arabic text,” Applied Sciences, vol. 11, no. 11, 2021. https://dx.doi.org/10.3390/app11115228.
- N. Srivastava et al., “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, Jan. 2014.
- Y. Alalawi et al., “A CNN-based Arabic diacritic symbol recognition system using domain adaptation,” in Proc. 8th Int. Conf. Sustainable Information Engineering and Technology (SIET), New York, USA: ACM, 2023, pp. 23–32. https://dx.doi.org/10.1145/3626641.3627212.
- H. Hewamalage et al., “Recurrent neural networks for time series forecasting: Current status and future directions,” International Journal of Forecasting, vol. 37, no. 1, pp. 388–427, 2021. https://dx.doi.org/10.1016/j.ijforecast.2020.06.008.
- Y. Belinkov and J. Glass, “Arabic diacritization with recurrent neural networks,” in Proc. EMNLP 2015, Lisbon, Portugal: ACL, Sep. 2015, pp. 2281–2285. https://dx.doi.org/10.18653/v1/D15-1274.
- A. Vaswani et al., “Attention is all you need,” in Proc. NeurIPS 2017, 2017. http://arxiv.org/abs/1706.03762
- A. Assad et al., “Transformer-based automatic Arabic text diacritization,” Sustainable Engineering and Innovation, vol. 6, no. 2, pp. 285–296, Nov. 2024. https://dx.doi.org/10.37868/sei.v6i2.id305.
- Y. Bengio et al., “Learning long-term dependencies with gradient descent is difficult,” IEEE Transactions on Neural Networks, vol. 5, no. 2, pp. 157–166, Mar. 1994. https://dx.doi.org/10.1109/72.279181.
- A. Gillioz, J. Casas, E. Mugellini and O. Abou Khaled, “Overview of the Transformer-based Models for NLP Tasks,” Proceedings of the Federated Conference on Computer Science and Information Systems, vol. 21, pp. 179–183, 2020, https://dx.doi.org/10.15439/2020F20.
- R. Al-Sabri and J. Gao, “LAMAD: A linguistic attentional model for Arabic text diacritization,” in Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic: ACL, Nov. 2021, pp. 3757–3764. https://dx.doi.org/10.18653/v1/2021.findings-emnlp.317.
- M. Al-Badrashiny et al., “A layered language model based hybrid approach to automatic full diacritization of Arabic,” in Proc. 3rd Arabic NLP Workshop, Valencia, Spain: ACL, Apr. 2017, pp. 177–184. https://dx.doi.org/10.18653/v1/W17-1321.
- H. Alaqel and K. El Hindi, “Improving diacritical Arabic speech recognition: Transformer-based models with transfer learning and hybrid data augmentation,” Information, vol. 16, no. 3, 2025. https://dx.doi.org/10.3390/info16030161.
- O. Obeid et al., “CAMeL Tools: An open source Python toolkit for Arabic NLP,” 2020. http://qatsdemo.cloudapp.net/farasa/
- A. Abdelali et al., “Farasa: A fast and furious segmenter for Arabic,” in Proc. NAACL Demonstrations, San Diego, CA: ACL, Jun. 2016, pp. 11–16. https://dx.doi.org/10.18653/v1/N16-3003.
- F. Alasmary et al., “CATT: Character-based Arabic Tashkeel Transformer,” arXiv preprint, vol. abs/2407.03236, 2024. https://api.semanticscholar.org/CorpusID:270924323
- B. Al-Rfooh et al., “Fine-Tashkeel: Fine-tuning byte-level models for accurate Arabic text diacritization,” in Proc. IEEE JEEIT 2023, pp. 199–204, 2023. https://api.semanticscholar.org/CorpusID:257767345
- B. M. King, “Analysis of variance,” in International Encyclopedia of Education, 3rd ed., Jan. 2009, pp. 32–36. https://dx.doi.org/10.1016/B978-0-08-044894-7.01306-3.
- A. Lazrek, “Arabic mathematical notation,” National Institute of Standards and Technology, USA, 2006. https://www.w3.org/TR/2006/NOTE-arabic-math-20060131/
- K. Darwish et al., “Arabic diacritization: Stats, rules, and hacks,” in Proc. 3rd Arabic NLP Workshop, Valencia, Spain: ACL, Apr. 2017, pp. 9–17. https://dx.doi.org/10.18653/v1/W17-1302.
- A. Fadel et al., “Neural Arabic text diacritization: State of the art results and a novel approach for Arabic NLP downstream tasks,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 21, no. 1, Jan. 2022. https://dx.doi.org/10.1145/3470849.
- O. Obeid et al., “CAMeL Tools: An open source Python toolkit for Arabic NLP,” in Proc. LREC 2020, Marseille, France: ELRA, May 2020, pp. 7022–7032. https://aclanthology.org/2020.lrec-1.868/
- T. Zerrouki, “Towards an open platform for Arabic language processing,” 2020. https://dx.doi.org/10.13140/RG.2.2.29882.82881.
- A. Fadel et al., “Neural Arabic text diacritization: State of the art results and a novel approach for machine translation,” in Proc. 6th Workshop on Asian Translation, Hong Kong, China: ACL, Nov. 2019, pp. 215–225. https://dx.doi.org/10.18653/v1/D19-5229.
- D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia, “SemEval-2017 Task 1: Semantic Textual Similarity – Multilingual and Cross-lingual Focused Evaluation,” in Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, Canada, Aug. 2017, pp. 1–14.
- Y. Tian, Y. Song, H. Xia, Y. Li, and Q. Zhang, “ECNU at SemEval-2017 Task 1: Leverage Kernel-Based Traditional NLP Features and Distributed Word Representations for Semantic Textual Similarity Estimation,” in Proc. 11th Int. Workshop on Semantic Evaluation (SemEval-2017), Vancouver, Canada, Aug. 2017, pp. 125–131. https://aclanthology.org/S17-2015
- H. Wu, H. Huang, P. Jian, Y. Guo, and C. Su, “BIT at SemEval-2017 Task 1: Using Semantic Information Space to Evaluate Semantic Textual Similarity,” in Proc. 11th Int. Workshop on Semantic Evaluation (SemEval-2017), Vancouver, Canada, Aug. 2017, pp. 77–84. https://aclanthology.org/S17-2007