Comparison of Large Language Models Supporting the Polish Language in Terms of Faithfulness in Retrieval-Augmented Generation Applications
Marcin Blachnik, Jakub Chmielewski
DOI: http://dx.doi.org/10.15439/2025F6165
Citation: Proceedings of the 20th Conference on Computer Science and Intelligence Systems (FedCSIS), M. Bolanowski, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 43, pages 111–119 (2025)
Abstract. This article presents an evaluation of Large Language Models with support for the Polish language, focusing on their ability to accurately extract detailed information embedded within input text called Faithfulness. This scenario reflects a typical use case in Retrieval-Augmented Generation systems, where precise factual recall is critical. For this purpose, a modified needle-in-a-haystack test was conducted, in which all queries targeted numerical values concealed within extended textual contexts. The evaluation was based on recent reports from Poland's Central Statistical Office (GUS), ensuring that the content was not included in the training data of the evaluated models.
References
- Z. Li, X. Xu, T. Shen, C. Xu, J.-C. Gu, Y. Lai, C. Tao, and S. Ma, “Leveraging large language models for NLG evaluation: Advances and challenges,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Y. Al-Onaizan, M. Bansal, and Y.-N. Chen, Eds. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024. https://dx.doi.org/10.18653/v1/2024.emnlp-main.896 pp. 16 028–16 045. [Online]. Available: https://aclanthology.org/2024.emnlp-main.896/
- Y. Kuratov, A. Bulatov, P. Anokhin, I. Rodkin, D. Sorokin, A. Sorokin, and M. Burtsev, “Babilong: Testing the limits of llms with long context reasoning-in-a-haystack,” Advances in Neural Information Processing Systems, vol. 37, pp. 106 519–106 554, 2024.
- Z. Guo, R. Jin, C. Liu, Y. Huang, D. Shi, L. Yu, Y. Liu, J. Li, B. Xiong, D. Xiong et al., “Evaluating large language models: A comprehensive survey,” arXiv preprint https://arxiv.org/abs/2310.19736, 2023.
- Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang et al., “A survey on evaluation of large language models,” ACM transactions on intelligent systems and technology, vol. 15, no. 3, pp. 1–45, 2024.
- T. Hu and X.-H. Zhou, “Unveiling llm evaluation focused on metrics: Challenges and solutions,” arXiv preprint https://arxiv.org/abs/2404.09135, 2024.
- N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” arXiv preprint https://arxiv.org/abs/1908.10084, 2019.
- J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y. Shen, S. Ma, H. Liu et al., “A survey on llm-as-a-judge,” arXiv preprint https://arxiv.org/abs/2411.15594, 2024.
- “Deepeval,” https://docs.confident-ai.com/, 2024.
- J. Saad-Falcon, O. Khattab, C. Potts, and M. Zaharia, “Ares: An automated evaluation framework for retrieval-augmented generation systems,” arXiv preprint https://arxiv.org/abs/2311.09476, 2023.
- H. Yu, A. Gan, K. Zhang, S. Tong, Q. Liu, and Z. Liu, “Evaluation of retrieval-augmented generation: A survey,” in CCF Conference on Big Data. Springer, 2024, pp. 102–120.
- Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, H. Wang, and H. Wang, “Retrieval-augmented generation for large language models: A survey,” arXiv preprint https://arxiv.org/abs/2312.10997, vol. 2, no. 1, 2023.
- J.-C. Gu, H.-X. Xu, J.-Y. Ma, P. Lu, Z.-H. Ling, K.-W. Chang, and N. Peng, “Model editing harms general abilities of large language models: Regularization to the rescue,” arXiv preprint https://arxiv.org/abs/2401.04700, 2024.
- W. Yang, F. Sun, X. Ma, X. Liu, D. Yin, and X. Cheng, “The butterfly effect of model editing: Few edits can trigger large language models collapse,” arXiv preprint https://arxiv.org/abs/2402.09656, 2024.
- S. Sonkar, N. Liu, and R. G. Baraniuk, “Regressive side effects of training language models to mimic student misconceptions,” arXiv e-prints, pp. arXiv–2404, 2024.
- R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” Advances in Neural Information Processing Systems, vol. 36, pp. 53 728–53 741, 2023.
- A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Panickssery, W. Gurnee, and N. Nanda, “Refusal in language models is mediated by a single direction,” arXiv preprint https://arxiv.org/abs/2406.11717, 2024.