Passage Retrieval of Polish Texts Using OKAPI BM25 and an Ensemble of Cross Encoders

Jakub Pokrywka

Passage Retrieval of Polish Texts Using OKAPI BM25 and an Ensemble of Cross Encoders

Jakub Pokrywka

DOI: http://dx.doi.org/10.15439/2023F9253

Citation: Proceedings of the 18th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 35, pages 1265–1269 (2023)

Full text

Abstract. Passage Retrieval has traditionally relied on lexical methods like TF-IDF and BM25. Recently, some neural network models have surpassed these methods in performance. However, these models face challenges, such as the need for large annotated datasets and adapting to new domains. This paper presents a winning solution to the Poleval 2023 Task 3: Passage Retrieval challenge, which involves retrieving passages of Polish texts in three domains: trivia, legal, and customer support. However, only the trivia domain was used for training and development data. The method used the OKAPI BM25 algorithm to retrieve documents and an ensemble of publicly available multilingual Cross Encoders for Reranking. Fine-tuning the reranker models slightly improved performance but only in the training domain, while it worsened in other domains.

References

L. Bonifacio, V. Jeronymo, H. Q. Abonizio, I. Campiotti, M. Fadaee, R. Lotufo, and R. Nogueira, “mmarco: A multilingual version of the ms marco passage ranking dataset,” 2021.
T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng, “Ms marco: A human generated machine reading comprehension dataset.,” CoRR, vol. abs/1611.09268, 2016.
N. Thakur, N. Reimers, A. Rücklé, A. Srivastava, and I. Gurevych, “BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
K. Wojtasik, V. Shishkin, K. Wołowiec, A. Janz, and M. Piasecki, “Beirpl: Zero shot information retrieval benchmark for the polish language,” arXiv preprint https://arxiv.org/abs/2305.19840, 2023.
R. Mroczkowski, P. Rybak, A. Wróblewska, and I. Gawlik, “HerBERT: Efficiently pretrained transformer-based language model for Polish,” in Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing, (Kiyv, Ukraine), pp. 1–10, Association for Computational Linguistics, Apr. 2021.
A. Chrabrowa, Ł. Dragan, K. Grzegorczyk, D. Kajtoch, M. Koszowski, R. Mroczkowski, and P. Rybak, “Evaluation of transfer learning for polish with a text-to-text model,” arXiv preprint https://arxiv.org/abs/2205.08808, 2022.
S. Dadas, M. Perełkiewicz, and R. Poświata, “Pre-training polish transformer-based language models at scale,” in Artificial Intelligence and Soft Computing, pp. 301–314, Springer International Publishing, 2020.
A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsupervised cross-lingual representation learning at scale,” CoRR, vol. abs/1911.02116, 2019.
P. He, X. Liu, J. Gao, and W. Chen, “Deberta: Decoding-enhanced bert with disentangled attention,” in International Conference on Learning Representations, 2021.
L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel, “mT5: A massively multilingual pre-trained text-to-text transformer,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (Online), pp. 483–498, Association for Computational Linguistics, June 2021.
F. Graliński, A. Wróblewska, T. Stanisławek, K. Grabowski, and T. Górecki, “GEval: Tool for debugging NLP datasets and models,” in Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, (Florence, Italy), pp. 254–262, Association for Computational Linguistics, Aug. 2019.
F. Graliński, R. Jaworski, Ł. Borchmann, and P. Wierzchoń, “Gonito.net – open platform for research competition, cooperation and reproducibility,” in Proceedings of the 4REAL Workshop: Workshop on Research Results Reproducibility and Resources Citation in Science and Technology of Language (A. Branco, N. Calzolari, and K. Choukri, eds.), pp. 13–20, 2016.
W. Kieraś and M. Woliński, “Morfeusz 2 – analizator i generator fleksyjny dla języka polskiego,” J ̨ezyk Polski, vol. XCVII, no. 1, pp. 75–83, 2017.
N. Reimers and I. Gurevych, “Making monolingual sentence embeddings multilingual using knowledge distillation,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 11 2020.
W. Wang, H. Bao, S. Huang, L. Dong, and F. Wei, “MiniLMv2: Multi-head self-attention relation distillation for compressing pretrained transformers,” in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, (Online), pp. 2140–2151, Association for Computational Linguistics, Aug. 2021.
J. Tiedemann and S. Thottingal, “OPUS-MT — Building open translation services for the World,” in Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT), (Lisbon, Portugal), 2020.