Hybrid retrievers with generative re-rankers

Marek Kozłowski

Hybrid retrievers with generative re-rankers

Marek Kozłowski

DOI: http://dx.doi.org/10.15439/2023F8119

Citation: Proceedings of the 18th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 35, pages 1271–1276 (2023)

Full text

Abstract. The passage retrieval task was announced during PolEval 2022 (SemEval-inspired evaluation campaign for natural language processing tools for Polish). Passage retrieval is a crucial part of modern open-domain question answering systems that rely on precise and efficient retrieval components to identify passages that contain correct answers. Our solution to this task is a multi-stage neural information retrieval system. The first stage consists of a candidate passage retrieval step in which passages are retrieved using federated search over sparse (BM25) and dense indexes (two FAISS indexes built using bi-encoder type retrievers based on Polish RoBERTa models). The second stage consists of a re-ranking step of the previously selected passages with a neural model, mt5-13b-mmarco. The model scores each passage by its relevance to a given query. The highest-scoring passages are then retained as the final result. Our system achieved second place in the competition.

References

S. Robertson, H. Zaragoza et al., “The probabilistic relevance framework: Bm25 and beyond,” Foundations and Trends® in Information Retrieval, vol. 3, no. 4, pp. 333–389, 2009.
N. Thakur, N. Reimers, A. Rücklé, A. Srivastava, and I. Gurevych, “BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. [Online]. Available: https://openreview.net/forum?id=wCu6T5xFjeJ
R. Nogueira, J. Lin, and A. Epistemic, “From doc2query to doctttttquery,” Online preprint, vol. 6, 2019.
J. Lin, D. Alfonso-Hermelo, V. Jeronymo, E. Kamalloo, C. Lassance, R. Nogueira, O. Ogundepo, M. Rezagholizadeh, N. Thakur, J.-H. Yang et al., “Simple yet effective neural ranking and reranking baselines for cross-lingual information retrieval,” arXiv preprint https://arxiv.org/abs/2304.01019, 2023.
V. Karpukhin, B. Oğuz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.-t. Yih, “Dense passage retrieval for open-domain question answering,” arXiv preprint https://arxiv.org/abs/2004.04906, 2020.
L. Xiong, C. Xiong, Y. Li, K.-F. Tang, J. Liu, P. Bennett, J. Ahmed, and A. Overwijk, “Approximate nearest neighbor negative contrastive learning for dense text retrieval,” arXiv preprint https://arxiv.org/abs/2007.00808, 2020.
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020.
G. M. Rosa, L. Bonifacio, V. Jeronymo, H. Abonizio, M. Fadaee, R. Lotufo, and R. Nogueira, “No parameter left behind: How distillation and model size affect zero-shot retrieval,” arXiv preprint https://arxiv.org/abs/2206.02873, 2022.
R. Nogueira, Z. Jiang, R. Pradeep, and J. Lin, “Document ranking with a pretrained sequence-to-sequence model,” in Findings of the Association for Computational Linguistics: EMNLP 2020. Online: Association for Computational Linguistics, Nov. 2020. http://dx.doi.org/10.18653/v1/2020.findings-emnlp.63 pp. 708–718. [Online]. Available: https://aclanthology.org/2020.findings-emnlp.63
T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng, “Ms marco: A human generated machine reading comprehension dataset,” choice, vol. 2640, p. 660, 2016.
L. Bonifacio, V. Jeronymo, H. Q. Abonizio, I. Campiotti, M. Fadaee, R. Lotufo, and R. Nogueira, “mmarco: A multilingual version of the ms marco passage ranking dataset,” arXiv preprint https://arxiv.org/abs/2108.13897, 2021.
Y. Tang, C. Tran, X. Li, P.-J. Chen, N. Goyal, V. Chaudhary, J. Gu, and A. Fan, “Multilingual translation with extensible multilingual pretraining and finetuning,” arXiv preprint https://arxiv.org/abs/2008.00401, 2020.
T. S. Almeida, T. Laitz, J. Seródio, L. H. Bonifacio, R. Lotufo, and R. Nogueira, “Neuralsearchx: Serving a multi-billion-parameter reranker for multilingual metasearch at a low cost,” arXiv preprint https://arxiv.org/abs/2210.14837, 2022.
V. Jeronymo, R. Lotufo, and R. Nogueira, “Neuralmind-unicamp at 2022 trec neuclir: Large boring rerankers for cross-lingual retrieval,” arXiv preprint https://arxiv.org/abs/2303.16145, 2023.