Enhancing Text Recognition of Damaged Documents through Synergistic OCR and Large Language Models
Thomas Asselborn, Jens Dörpinghaus, Faraz Kausar, Ralf Möller, Sylvia Melzer
DOI: http://dx.doi.org/10.15439/2024F7400
Citation: Communication Papers of the 19th Conference on Computer Science and Intelligence Systems (FedCSIS), M. Bolanowski, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 41, pages 29–36 (2024)
Abstract. Optical Character Recognition (OCR) remains a highly relevant area of research in pattern recognition. Its applications span various domains, including supporting reading for the visually impaired, interpreting Morse codes, capturing postal addresses, evaluating emails, scanning price tags and passports, and extracting text from digitised documents. As the volume of digitised data continues to grow, challenges arise in capturing the semantic structure of documents through logical structure analysis and providing data suitable for information retrieval to answer specific research questions. While classic OCR processes like Tesseract and OCRopus work well for contemporary digitised documents, there is room for improvement in text and word recognition of historical documents that are severely damaged. Large Language Models (LLMs) like GPT-4 can be effectively used for text recognition tasks, utilising their advanced natural language processing capabilities to interpret and reconstruct unclear or damaged text, offering potential for improving the overall text recognition process. However, challenges arise additionally when documents contain e.g. a mixture of single-column and double-column text, images and text, or words not known or blocked by the agents.
