Treating OCR Output as a Language (TOOL) – Improving OCR Output with Seq2Seq Translation
Thomas Asselborn, Magnus Bender, Ralf Möller, Sylvia Melzer
DOI: http://dx.doi.org/10.15439/2025F1103
Citation: Proceedings of the 20th Conference on Computer Science and Intelligence Systems (FedCSIS), M. Bolanowski, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 43, pages 471–478 (2025)
Abstract. Optical Character Recognition (OCR) systems are frequently used to digitize text, but often produce noisy results, especially with historical, poor quality or multilingual data. Despite advances in OCR technology, post-processing remains a major bottleneck. We propose TOOL (Treating OCR Output as a Language), a new approach that understands OCR correction as a machine translation task. By treating noisy OCR text as a language in its own right, TOOL employs sequence-to-sequence models like Marian to translate it into clean, standardized text. This method is scalable, model-independent and language-flexible. We demonstrate this approach by translating ``OCR German'' to Standard German from around 1871 to the present day, improving accuracy at token level by using matched training pairs of OCR output and base text.
References
- Altbach, P.G., Reisberg, L., Rumbley, L.E.: Trends in Global Higher Education: Tracking an Academic Revolution. UNESCO Publishing, Paris (2009), https://unesdoc.unesco.org/ark:/48223/pf0000183219, a report prepared for the UNESCO 2009 World Conference on Higher Education
- Asselborn, T., Dörpinghaus, J., Kausar, F., Möller, R., Melzer, S.: Enhancing Text Recognition of Damaged Documents through Synergistic OCR and Large Language Models, pp. 29–36. Polish Information Processing Society (Sep 2024). https://doi.org/10.15439/2024F7400
- Bahdanau, D., Cho, K., Bengio, Y.: Neural Machine Translation by Jointly Learning to Align and Translate (2016), https://arxiv.org/abs/1409.0473
- Banerjee, S., Lavie, A.: METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In: Goldstein, J., Lavie, A., Lin, C.Y., Voss, C. (eds.) Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. pp. 65–72. Association for Computational Linguistics, Ann Arbor, Michigan (Jun 2005), https://aclanthology.org/W05-0909/
- Bundesregierung der Bundesrepublik Deutschland: Handwerksordnung (HwO) in der Neufassung vom 24. September 1998. Bundesgesetzblatt Teil I Nr. 67, Bonn (1998), http://www.bgbl.de/xaver/bgbl/start.xav?startbk=Bundesanzeiger_BGBl&jumpTo=bgbl198s3074.pdf
- Bundesregierung der Bundesrepublik Deutschland: Berufsbildungsgesetz (BBiG) in der Neufassung vom 23. März 2005. Bundesgesetzblatt Teil I Nr. 20, Bonn (2005), https://www.bgbl.de/xaver/bgbl/start.xav?startbk=Bundesanzeiger_BGBl&jumpTo=bgbl105s0931.pdf
- Busemeyer, M.R., Trampusch, C.: The Political Economy of Collective Skill Formation. Oxford University Press (11 2011). https://doi.org/10.1093/acprof:oso/9780199599431.001.0001
- Cedefop: Vocational education and training in Europe, 1995–2035: Scenarios for European vocational education and training in the 21st century. No. 114 in Cedefop reference series, Publications Office of the European Union, Luxembourg (2020). https://doi.org/10.2801/794471
- Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs/1810.04805 (2018), http://arxiv.org/abs/1810.04805
- European Commission: Education Levels (nd), https://education.ec.europa.eu/education-levels, accessed: 2025-05-09
- Feng, F., Yang, Y., Cer, D., Arivazhagan, N., Wang, W.: Language-agnostic BERT Sentence Embedding (2022), https://arxiv.org/abs/2007.01852
- Junczys-Dowmunt, M., Grundkiewicz, R., Dwojak, T., Hoang, H., Heafield, K., Neckermann, T., Seide, F., Germann, U., Aji, A.F., Bogoychev, N., Martins, A.F.T., Birch, A.: Marian: Fast Neural Machine Translation in C++. In: Liu, F., Solorio, T. (eds.) Proceedings of ACL 2018, System Demonstrations. pp. 116–121. Association for Computational Linguistics, Melbourne, Australia (Jul 2018). https://doi.org/10.18653/v1/P18-4020
- Junczys-Dowmunt, M., Grundkiewicz, R., Dwojak, T., Hoang, H., Heafield, K., Neckermann, T., Seide, F., Germann, U., Aji, A.F., Bogoychev, N., Martins, A.F.T., Birch, A.: Marian: Fast Neural Machine Translation in C++ (2018), https://arxiv.org/abs/1804.00344
- Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., Zettlemoyer, L.: BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J. (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 7871–7880. Association for Computational Linguistics, Online (Jul 2020). https://doi.org/10.18653/v1/2020.acl-main.703
- Lin, C.Y.: ROUGE: A Package for Automatic Evaluation of Summaries. In: Text Summarization Branches Out. pp. 74–81. Association for Computational Linguistics, Barcelona, Spain (Jul 2004), https://aclanthology.org/W04-1013/
- Loper, E., Bird, S.: NLTK: the Natural Language Toolkit. In: Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics - Volume 1. p. 63–70. ETMTNLP ’02, Association for Computational Linguistics, USA (2002). https://doi.org/10.3115/1118108.1118117
- Melhuish, E.C., Ereky-Stevens, K., Petrogiannis, K., Aricescu, A.M., Penderi, E., Rentzou, K., Tawell, A., Slot, P., Broekhuizen, M., Leseman, P.: A review of research on the effects of Early Childhood Education and Care (ECEC) upon child development. Technical report, European Commission (2015), https://eprints.bbk.ac.uk/id/eprint/16443/, cARE project; Curriculum Quality Analysis and Impact Review of European Early Childhood Education and Care (ECEC)
- Németh, L., Foundation, F.: Hunspell, https://hunspell.github.io/, accessed: 2025-02-15
- OECD: Education at a Glance 2021: OECD Indicators. OECD Publishing, Paris (2021). https://doi.org/10.1787/b35a14e5-en
- Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a Method for Automatic Evaluation of Machine Translation. In: Isabelle, P., Charniak, E., Lin, D. (eds.) Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. pp. 311–318. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA (Jul 2002). https://doi.org/10.3115/1073083.1073135
- Radford, A., Narasimhan, K.: Improving Language Understanding by Generative Pre-Training. In: Computer Science, Linguistics (2018), https://api.semanticscholar.org/CorpusID:49313245
- Schiersmann, C.: Weiterbildungsberatung im Kontext der Nationalen Weiterbildungsstrategie: Finanzielle und strukturelle Aspekte. Hessische Blätter für Volksbildung 72(1), 43–53 (2022)
- Smith, R.: An Overview of the Tesseract OCR Engine. In: ICDAR ’07: Proceedings of the Ninth International Conference on Document Analysis and Recognition. pp. 629–633. IEEE Computer Society, Washington, DC, USA (2007), https://storage.googleapis.com/pub-tools-public-publication-data/pdf/33418.pdf
- Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to Sequence Learning with Neural Networks (2014), https://arxiv.org/abs/1409.3215
- UNESCO, UNICEF: Global Report on Early Childhood Care and Education: The Right to a Strong Foundation. UNESCO and UNICEF, Paris (2024). https://doi.org/10.54675/FWQA2113
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention Is All You Need (2023), https://arxiv.org/abs/1706.03762