Passage Retrieval of Polish Texts Using OKAPI BM25 and an Ensemble of Cross Encoders

Passage Retrieval has traditionally relied on lexical methods like TF-IDF and BM25. Recently, some neural network models have surpassed these methods in performance. However, these models face challenges, such as the need for large annotated datasets and adapting to new domains. This paper presents a winning solution to the Poleval 2023 Task 3: Passage Retrieval challenge, which involves retrieving passages of Polish texts in three domains: trivia, legal, and customer support. However, only the trivia domain was used for training and development data. The method used the OKAPI BM25 algorithm to retrieve documents and an ensemble of publicly available multilingual Cross Encoders for Reranking. Fine-tuning the reranker models slightly improved performance but only in the training domain, while it worsened in other domains.


I. INTRODUCTION
P ASSAGE retrieval involves the task of retrieving a set of relevant text passages from a large collection of documents based on a given query.Typically, these passages are presented in descending order of relevance.The most commonly used method for passage retrieval is through lexical approaches like OKAPI BM25.Though, lexical models cannot capture semantic relationships between words, phrases, and sentences.To address this, neural language models can be employed.These models are often pretrained on extensive text corpora and then fine-tuned specifically for passage retrieval.There are two common setups for utilizing neural models in this task: complete passage retrieval using a neural model or combining another retrieval engine to retrieve a subset of passages, followed by using the neural model to select the most relevant ones.The latter approach is employed when the reranking model is too slow to process an entire document collection.
The Poleval 2023 Task 3: Passage Retrieval challenge aims to identify the best method for passage retrieval in Polish texts.The competition's test dataset comprises three domains: wiki-trivia, legal-questions, and allegro-faq.However, only the wiki-trivia domain is provided as the training and development dataset.
In this paper, we discuss the two-stage approach that achieved a score of 69.36 NDCG@10 on the final test competition dataset.Our method involves two phases.Firstly, we use the OKAPI BM25 algorithm to retrieve relevant passages.Then, an ensemble of Cross Encoder models is employed to rerank these passages.These models are publicly available multilingual models that have been trained on various languages (including Polish) and finetuned on multilingual corpora for passage reranking, as outlined in [1].We used these models with no further finetuning on the challenge dataset for two domains: legal-questions and allegro-faq.For the wiki-trivia domain, one model was fine-tuned and used in combination with models that had no further finetuning.

A. Reranker models and modern neural Information Retrieval
MS MARCO [2] is a large publicly available reranking dataset retrieved by Bing.The dataset includes queries, retrieved documents by search engine, and a label on whether a user clicked a document.The corpus is in the English language.Recently, authors of mMARCO [1] translated this corpus into many languages (but not into Polish though) and trained Cross Encoder reranker models on it.The base models were multilingual.The performance was effective not only for translated languages but also for not translated languages, only visible by models in the semisupervised pretraining phase.
BEIR [3] is an Information Retrieval benchmark for Zeroshot Evaluation between different domains.The authors provided many comparisons between different retrieval architectures.Very recently, the benchmark for Polish Information Retrieval was released in BEIR-PL paper [4].

A. Data
The task is to retrieve the relevant passages given a query.The queries and passages are in the Polish language.There are separate domains: wiki-trivia, legal-questions, allegrofaq.In the below subsection, each domain is presented.There are the following datasets: training (train), development (dev), test-A (preliminary test set), and test-B (final test set).For the training and development dataset, golden truth data was released during the competition, but the golden truth dataset was not.After competitions, the test set golden truth was released to https://github.com/poleval/2022-passage-retrieval-secret.Training and development datasets consist of only wiki-trivia, but the test dataset consists of all three domains.Below all domains are described.Some dataset statistics are given in Table I.Domains vary greatly in the number of passages and mean relevant passages per query.
1) wiki-trivia: Questions are general-knowledge typical for TV quiz shows, such as Fifteen to One or Polish equivalent Jeden z dziesięciu.For each question, there were manually selected up to five relevant passages (the mean number for the training dataset is 3.28 with a standard deviation of 1.45).The passages corpus consists of 7097322 elements.This domain was selected for train, dev, and test datasets.There are 4041 questions in the train dataset, 599 in the dev dataset, 400 in the test-A dataset, and 891 in the test-B dataset.Below, one example question with all correct passages is presented.

B. Evaluation Metric
The GEval evaluation tool [11] uses Normalized Discounted Cumulative Gain for the top ten passages (NDCG@10) as the challenge metric.The challenge was hosted on the Gonito platform [12], and the final evaluation was conducted on the test-B dataset across all domains.It should be noted that the sample split between domains is not equal, which means that some domains have a greater impact on the final score.

IV. METHOD
The solution involves two stages: Retrieval and Reranking.Retrieval is carried out using the lexical method OKAPI BM25, which is quick but not as effective as a neural ranking model.Additionally, it does not require training.The best performing method for Reranking is through Cross Encoders, but it is slow as it requires processing every query-passage pair.Due to its time-consuming nature, it can only operate on 1266 PROCEEDINGS OF THE FEDCSIS.WARSAW, POLAND, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
a limited set of passages, except for the allegro-faq domain, which consists of only 921 passages.

B. Reranking phase
The reranking phase was performed using an ensemble of multilingual reranker models based on Cross-Encoder architecture.We used different sets of the ensemble for wiki-trivia domain and legal-questions with allegrofaq questions.Both are described in the following section.The ensembles were created by summing up all the individual models' probability scores.Finetuning, if performed, was loosely based on a script from Sentence-Transformer library [14], namely https: //github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_cross-encoder_scratch.py.The process of finetuning and inference was completed on A100 GPU card.We used one 100 negative query-passage pair for each positive passage selected from the training dataset.The negative passage selection was from the top 2000 passages returned by the described OKAPI BM25 algorithm.The used Loss was BCEWithLogitsLoss with a constant learning rate scheduler of 1e-6 and 2000 warmup steps.The best-performing model was selected for inference from training for ten epochs.
1) wiki-trivia: Reranking was based on the top 3000 results from the OKAPI BM25 algorithm.Because wiki-trivia passages are relatively short, they only require a little time, although, during experiments, we observed that reranking with above 1000 passages, there is not much gain in the metric score.
2) legal-questions and allegro-faq: Reranking was performed on top 1500 passages for legal-questions.The limit was lower than for wiki-trivia due to the length of passages collection and longer computation time.For the allegro-faq domain, reranking was performed on all the passages since the whole collection consists of only 921 passages.For both domains, the same ensemble was used.The following models were used without further finetuning to the competition dataset.We conducted experiments using models fine-tuned to wiki-trivia, but their performance dropped drastically.Finally, we used the following models: • Model unicamp-dl/mt5-13b-mmarco-100k via https: //huggingface.co/unicamp-dl/mt5-13b-mmarco-100kdescribed in the previous section.
We translated Polish passages and queries into English using a machine translation model accessed by https:// huggingface.co/gsarti/opus-mt-tc-en-pl[16].English Cross Encoder reranking models did not perform on the translated texts better than multilingual reranking models on Polish texts tough.

B. Bi Encoder models
We experimented with various publicly available Bi Encoder models, using them as one-stage retrieval models.Unfortunately, their performance was significantly inferior to that of the OKAPI BM25 algorithm operating alone.However, combining the OKAPI BM25 and Bi Encoder models as retrieval models for further reranking with the Cross Encoder model may lead to improved results and is a promising area for research.Our highest Bi Encoder score for untranslated documents was 9.26 NDCG@10, achieved using the sentencetransformers/distiluse-base-multilingual-cased-v1 model.For translated texts into English, our highest score was 21.00, obtained using the sentence-transformers/all-mpnet-base-v2 model.

C. Translating MS MARCO into Polish
MMARCO does not include translations for Polish texts.We've attempted translating MS MARCO into English using model gsarti/opus-mt-tc-en-pl and training several reranking models on this data.The approach is similar to [4].Nevertheless, this work was published after the competition.In our case, this approach didn't yield better results than large multilingual models.

VII. CONCLUSIONS
This paper summarizes our solution to Poleval 2023 Task 3: Passage Retrieval.The system operates in two stages, utilizing OKAPI BM25 for retrieval and a multilingual ensemble of Cross Encoders for reranking.However, the system's performance varies between domains due to the limited availability of training data for only one domain.While fine-tuning the neural model can enhance results for this domain, it may have a negative impact on other domains.

TABLE II NDCG
@10 RESULTS FOR THE WHOLE FINAL TESTING DATASET TEST-B AND SPLIT INTO DOMAINS.FT STANDS FOR THE MODEL FINE-TUNING TO THE COMPETITION DATA, WHEREAS NO-FT STANDS FOR NO FINE-TUNING.THE NUMBER AT THE RIGHT OF THE MODEL NAME STANDS FOR THE RERANKING SIZE FROM THE OKAPI BM25 ALGORITHM.SOME EXPERIMENTS WERE NOT CONDUCTED OR SAVED.IN THIS CASE, THE SCORE IS LABELED AS "-" RESULTS FOR THE WHOLE PRELIMINARY TESTING DATASET TEST-A AND SPLIT INTO DOMAINS.FT STANDS FOR THE MODEL FINE-TUNING TO THE COMPETITION DATA, WHEREAS NO-FT STANDS FOR NO FINE-TUNING.THE NUMBER AT THE RIGHT OF THE MODEL NAME STANDS FOR THE RERANKING SIZE FROM THE OKAPI BM25 ALGORITHM.SOME EXPERIMENTS WERE NOT CONDUCTED OR SAVED.IN THIS CASE, THE SCORE IS LABELED AS "-"