Opportunities and Challenges of LLMs as Post-OCR Correctors
Radoslav Koynov, Triet Ho Anh Doan
DOI: http://dx.doi.org/10.15439/2025F4697
Citation: Communication Papers of the 20th Conference on Computer Science and Intelligence Systems (FedCSIS), M. Bolanowski, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 45, pages 111–118 (2025)
Abstract. Large Language Models (LLMs) have demonstrated potential as zero-shot Post-OCR correctors for historical texts. However, previous research has typically focused on a single data set and only evaluated Character Error Rate (CER) or Word Error Rate (WER). This study investigates the potential of LLMs to enhance the accuracy of Optical Character Recognition (OCR) and the limitations of the models. To this end, an evaluation of the approach is conducted for a number of German and English historical datasets, with an in-depth analysis of the model corrections and deviation from the ground truth. We demonstrate that LLMs have the capacity to enhance the quality of OCR results as zero-shot correctors in some cases, and fine-tuning LLMs shows promise as part of an LLM-based Post-OCR correction system, if certain risks are mitigated carefully.
References
- E. Soper, S. Fujimoto, and Y.-Y. Yu, “BART for Post-Correction of OCR Newspaper Text,” in Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), W. Xu, A. Ritter, T. Baldwin, and A. Rahimi, Eds. Online: Association for Computational Linguistics, Nov. 2021. https://dx.doi.org/10.18653/v1/2021.wnut-1.31 pp. 284–290. [Online]. Available: https://aclanthology.org/2021.wnut-1.31/
- A. Thomas, R. Gaizauskas, and H. Lu, “Leveraging LLMs for Post-OCR Correction of Historical Newspapers,” in Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024, R. Sprugnoli and M. Passarotti, Eds. Torino, Italia: ELRA and ICCL, may 2024, pp. 116–121. [Online]. Available: https://aclanthology.org/2024.lt4hala-1.14
- C. W. Booth, A. Thomas, and R. Gaizauskas, “BLN600: A Parallel Corpus of Machine/Human Transcribed Nineteenth Century Newspaper Texts,” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, and N. Xue, Eds. Torino, Italia: ELRA and ICCL, May 2024, pp. 2440–2446. [Online]. Available: https://aclanthology.org/2024.lrec-main.219/
- G. Chiron, A. Doucet, M. Coustaty, and J.-P. Moreux, “ICDAR2017 Competition on Post-OCR Text Correction,” in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, 2017. https://dx.doi.org/10.1109/ICDAR.2017.232 pp. 1423–1428.
- Rigaud, Christophe and Doucet, Antoine and Coustaty, Mickaël and Moreux, Jean-Philippe, “ICDAR 2019 Competition on Post-OCR Text Correction,” in 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019. https://dx.doi.org/10.1109/ICDAR.2019.00255 pp. 1588–1593.
- J. Ramirez-Orta, E. Xamena, A. Maguitman, E. Milios, and A. J. Soto, “Post-OCR Document Correction with large Ensembles of Character Sequence-to-Sequence Models,” 2022. [Online]. Available: https://arxiv.org/abs/2109.06264
- J. Kanerva, C. Ledins, S. Käpyaho, and F. Ginter, “OCR Error Post-Correction with LLMs in Historical Documents: No Free Lunches,” 2025. [Online]. Available: https://arxiv.org/abs/2502.01205
- J. Zhang, W. Haverals, M. Naydan, and B. W. Kernighan, “Post-OCR Correction with OpenAI’s GPT Models on Challenging English Prosody Texts,” in Proceedings of the ACM Symposium on Document Engineering 2024, ser. DocEng ’24. New York, NY, USA: Association for Computing Machinery, 2024. doi: 10.1145/3685650.3685669. ISBN 9798400711695. [Online]. Available: https://doi.org/10.1145/3685650.3685669
- “Deutsches Textarchiv,” https://www.deutschestextarchiv.de/, accessed: 2025-05-22.
- “gt_structure_text,” https://github.com/OCR-D/gt_structure_text, Mar 2025, accessed: 2025-05-22.
- S. Weil, “Training German Print,” https://github.com/UB-Mannheim/kraken/wiki/Training-German-Print, Jan 2024, accessed: 2025-05-22.
- OpenAI, “GPT-4o Mini: Advancing Cost-Efficient Intelligence,” https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/, 2024, accessed: 2025-05-16.
- M. AI, “LLaMA 3.3 Model Cards and Prompt Formats,” https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_3/, 2024, accessed: 2025-05-16.
- M. Gerber and T. Q. S. Team, “Dinglehopper: An OCR Evaluation Tool,” https://github.com/qurator-spk/dinglehopper, 2025, accessed: 2025-05-14.
- S. Pletschacher and A. Antonacopoulos, “The PAGE (Page Analysis and Ground-Truth Elements) Format Framework,” in 2010 20th International Conference on Pattern Recognition. IEEE, 2010, pp. 257–260.
- “ALTO Technical Metadata for Layout and Text Objects,” https://www.loc.gov/standards/alto/, Jun 2022, accessed: 2025-05-23.
- OpenAI, “OpenAI Fine-Tuning API,” https://platform.openai.com/docs/guides/fine-tuning, 2024, accessed: 2025-05-14.
- M. Lamba and M. Madhusudhan, “Exploring OCR Errors in Full-Text Large Documents: A Study of LIS Theses and Dissertations,” Library Philosophy and Practice (e-journal), no. 7824, 2023. [Online]. Available: https://digitalcommons.unl.edu/libphilprac/7824/
- J. Bourne, “Scrambled Text: Training Language Models to correct OCR Errors using Synthetic Data,” 2024. [Online]. Available: https://arxiv.org/abs/2409.19735