Benchmarking OpenAI's APIs and Large Language Models for Repeatable, Efficient Question Answering Across Multiple Documents

Elena Filipovska; Ana Mladenovska; Merxhan Bajrami; Jovana Dobreva; Vellislava Hillman; Petre Lameski; Eftim Zdravevski

Benchmarking OpenAI's APIs and Large Language Models for Repeatable, Efficient Question Answering Across Multiple Documents

Elena Filipovska, Ana Mladenovska, Merxhan Bajrami, Jovana Dobreva, Vellislava Hillman, Petre Lameski, Eftim Zdravevski

DOI: http://dx.doi.org/10.15439/2024F3979

Citation: Proceedings of the 19th Conference on Computer Science and Intelligence Systems (FedCSIS), M. Bolanowski, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 39, pages 107–117 (2024)

Full text

Abstract. The rapid growth of document volumes and complexity in various domains necessitates advanced automated methods to enhance the efficiency and accuracy of information extraction and analysis. This paper aims to evaluate the efficiency and repeatability of OpenAI's APIs and other Large Language Models (LLMs) in automating question-answering tasks across multiple documents, specifically focusing on analyzing Data Privacy Policy (DPP) documents of selected EdTech providers. We test how well these models perform on large-scale text processing tasks using the OpenAI's LLM models (GPT 3.5 Turbo, GPT 4, GPT 4o) and APIs in several frameworks: direct API calls (i.e., one-shot learning), LangChain, and Retrieval Augmented Generation (RAG) systems. We also evaluate a local deployment of quantized versions (with FAISS) of LLM models (Llama-2-13B-chat-GPTQ). Through systematic evaluation against predefined use cases and a range of metrics, including response format, execution time, and cost, our study aims to provide insights into the optimal practices for document analysis. Our findings demonstrate that using OpenAI's LLMs via API calls is a workable workaround for accelerating document analysis when using a local GPU-powered infrastructure is not a viable solution, particularly for long texts. On the other hand, the local deployment is quite valuable for maintaining the data within the private infrastructure. Our findings show that the quantized models retain substantial relevance even with fewer parameters than ChatGPT and do not impose processing restrictions on the number of tokens. This study offers insights on maximizing the use of LLMs for better efficiency and data governance in addition to confirming their usefulness in improving document analysis procedures.

References

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. A survey of large language models, 2023.
E Guo, M Gupta, J Deng, Y J Park, M Paget, and C Naugler. Automated paper screening for clinical reviews using large language models: Data analysis study. J Med Internet Res, 26, 2024.
Mehrdad Safaei and Justin Longo. The end of the policy analyst? testing the capability of artificial intelligence to generate plausible, persuasive, and useful policy analysis. Digital Government: Research and Practice, 5(1):1–35, 2024.
Kazem Jahanbakhsh, Mahdi Hajiabadi, Vipul Gagrani, Jennifer Louie, and Saurabh Khanwalkar. Beyond hallucination: Building a reliable question answering & explanation system with gpts.
Matt Triff Charith Gunasekara, Noah Chalifour. Question answering artificial intelligence, chatbot on military dress policy: A natural language processing based application. Defence Research and Development Canada, 2021.
Jaromir Savelka, Arav Agarwal, Christopher Bogart, and Majd Sakr. From gpt-3 to gpt-4: On the evolving efficacy of llms to answer multiple-choice questions for programming classes in higher education. In International Conference on Computer Supported Education, pages 160–182. Springer, 2023.
Michael G. Rizzo, Nathan Cai, and David Constantinescu. The performance of chatgpt on orthopaedic in-service training exams: A comparative study of the gpt-3.5 turbo and gpt-4 models in orthopaedic education. Journal of Orthopaedics, 2024.
Martin Hasal, Jana Nowaková, Khalifa Ahmed Saghair, Hussam Abdulla, Václav Snášel, and Lidia Ogiela. Chatbots: Security, privacy, data protection, and social aspects. Concurrency and Computation: Practice and Experience, 33(19):e6426, 2021.
Hillman V., Barud K., Henne T., Zdravevski E., Saillant C., and Radkoff E. In the Fine Print: Investigating EdTech Providers’ Terms of Services and Data Privacy Commitment. Working Paper, 2024.
Leonard Richardson. Beautiful soup documentation, 2007.
Konstantinos I. Roumeliotis and Nikolaos D. Tselikas. Chatgpt and open-ai models: A preliminary review. Future Internet, 2023.
Oguzhan Topsakal and Tahir Cetin Akinci. Creating large language model applications utilizing langchain: A primer on developing llm apps fast. In International Conference on Applied Engineering and Natural Sciences, volume 1, pages 1050–1056, 2023.
Paulo Finardi, Leonardo Avila, Rodrigo Castaldoni, Pedro Gengo, Celio Larcher, Marcos Piau, Pablo Costa, and Vinicius Caridá. The chronicles of rag: The retriever, the chunk and the generator. arXiv preprint https://arxiv.org/abs/2401.07883, 2024.
Toni Taipalus. Vector database management systems: Fundamental concepts, use-cases, and current challenges. Cognitive Systems Research, 2024.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint https://arxiv.org/abs/2307.09288, 2023.
Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The faiss library. arXiv preprint https://arxiv.org/abs/2401.08281, 2024.