Converting German Historical Legal Documents to TEI XML including challenges with Table Extraction
Thomas Reiser, Petra Steiner
DOI: http://dx.doi.org/10.15439/2024F8851
Citation: Communication Papers of the 19th Conference on Computer Science and Intelligence Systems (FedCSIS), M. Bolanowski, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 41, pages 139–149 (2024)
Abstract. The job archive at the Federal Institute for Vocational Education and Training contains thousands of historical German VET and CVET regulations from the last 100 years. However, these are hardly accessible because they are currently only available in their original paper form.We present a workflow that transcribes images of these regulations into the TEI XML format which preserves the logical document structure and stores metadata. It is widely used for digital archives and represents an important step towards a fully digitalized archive. This paper addresses issues caused by poor page segmentation of the applied OCR methods and presents rules that can reconstruct a large part of the documents' hierarchy. A straightforward table recognition method for tables with borders is presented, as well as a metadata extraction procedure for the selected data set. While our approach is generic and functional, further research is necessary to develop a fully automated workflow.
References
- M. Koistinen, K. Kettunen, and J. Kervinen, “How to Improve Optical Character Recognition of Historical Finnish Newspapers Using Open Source Tesseract OCR Engine,” Proc. of LTC, pp. 279–283, 2017.
- A. Nabizai and H.-G. Fill, “Eine Modellierungsmethode zur Visualisierung und Analyse von Gesetzestexten,” Jusletter IT, February 2017. [Online]. Available: http://eprints.cs.univie.ac.at/5131/
- V. N. Sai Rakesh Kamisetty, B. Sohan Chidvilas, S. Revathy, P. Jeyanthi, V. M. Anu, and L. Mary Gladence, “Digitization of Data from Invoice using OCR,” in 2022 6th International Conference on Computing Methodologies and Communication (ICCMC), 2022. http://dx.doi.org/10.1109/IC-CMC53470.2022.9754117 pp. 1–10.
- H. Hamann, “The German Federal Courts Dataset 1950–2019: From Paper Archives to Linked Open Data,” Journal of empirical legal studies, vol. 16, no. 3, pp. 671–688, 2019. http://dx.doi.org/https://doi.org/10.1111/jels.12230
- N. Kertkeidkachorn and R. Ichise, “T2KG: An End-to-End System for Creating Knowledge Graph from Unstructured Text,” in Workshops at the Thirty-First AAAI Conference on Artificial Intelligence, 2017.
- J. L. Martinez-Rodriguez, I. Lopez-Arevalo, and A. B. Rios-Alvarado, “OpenIE-based approach for Knowledge Graph construction from text,” Expert Systems with Applications, vol. 113, pp. 339–355, 2018. http://dx.doi.org/https://doi.org/10.1016/j.eswa.2018.07.017. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0957417418304329
- J. Dörpinghaus and A. Stefan, “Knowledge extraction and applications utilizing context data in knowledge graphs,” in 2019 Federated Conference on Computer Science and Information Systems (FedCSIS), 2019. http://dx.doi.org/10.15439/2019F3 pp. 265–272.
- J. Dörpinghaus, A. Stefan, B. Schultz, and M. Jacobs, “Context mining and graph queries on giant biomedical knowledge graphs,” Knowledge and Information Systems, vol. 64, no. 5, pp. 1239–1262, 2022. http://dx.doi.org/https://doi.org/10.1007/s10115-022-01668-7
- Y. Fettach, M. Ghogho, and B. Benatallah, “Knowledge graphs in education and employability: A survey on applications and techniques,” IEEE Access, vol. 10, pp. 80 174–80 183, 2022. http://dx.doi.org/10.1109/ACCESS.2022.3194063
- J. Dörpinghaus, S. Klante, M. Christian, C. Meigen, and C. Düing, “From social networks to knowledge graphs: A plea for interdisciplinary approaches,” Social Sciences & Humanities Open, vol. 6, no. 1, p. 100337, 2022. http://dx.doi.org/https://doi.org/10.1016/j.ssaho.2022.100337
- J. Dörpinghaus, V. Weil, and J. Binnewitt, “Towards the analysis of longitudinal data in knowledge graphs on job ads,” in The Workshop on Computational Optimization. Springer, 2022. http://dx.doi.org/https://doi.org/10.1007/978-3-031-57320-0_4 pp. 52–70.
- A. Fischer and J. Dörpinghaus, “Web mining of online resources for german labor market research and education: Finding the ground truth?” Knowledge, vol. 4, no. 1, pp. 51–67, 2024. doi: https://doi.org/10.3390/knowledge4010003
- C. Reul, D. Christ, A. Hartelt, N. Balbach, M. Wehner, U. Springmann, C. Wick, C. Grundig, A. Büttner, and F. Puppe, “OCR4all—An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings,” Applied Sciences, vol. 9, no. 22, p. 4853, 2019. http://dx.doi.org/https://doi.org/10.3390/app9224853
- J. M. Jayoma, E. S. Moyon, and E. M. O. Morales, “OCR Based Document Archiving and Indexing Using PyTesseract: A Record Management System for DSWD Caraga, Philippines,” in 2020 IEEE 12th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment, and Management (HNICEM), 2020. http://dx.doi.org/10.1109/HNICEM51456.2020.9400000 pp. 1–6.
- S. Van Nguyen, D. A. Nguyen, and L. S. Q. Pham, “Digitalization of Administrative Documents A Digital Transformation Step in Practice,” in 2021 8th NAFOSTED Conference on Information and Computer Science (NICS), 2021. http://dx.doi.org/10.1109/NICS54270.2021.9701547 pp. 519–524.
- S. Tsujimoto and H. Asada, “Major components of a complete text reading system,” Proceedings of the IEEE, vol. 80, no. 7, pp. 1133–1149, 1992. http://dx.doi.org/10.1109/5.156475
- J. v. Beusekom, D. Keysers, F. Shafait, and T. Breuel, “Example-based logical labeling of document title page images,” in Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), vol. 2, 2007. http://dx.doi.org/10.1109/ICDAR.2007.4377049 pp. 919–923.
- S. Klink and T. Kieninger, “Rule-based document structure understanding with a fuzzy combination of layout and textual features,” International Journal on Document Analysis and Recognition, vol. 4, no. 1, pp. 18–26, 2001. http://dx.doi.org/https://doi.org/10.1007/PL00013570
- P. Pathirana, A. Silva, T. Lawrence, T. Weerasinghe, and R. Abeyweera, “A comparative evaluation of pdf-to-html conversion tools,” in 2023 International Research Conference on Smart Computing and Systems Engineering (SCSE), vol. 6, 2023. http://dx.doi.org/10.1109/SCSE59836.2023.10214989 pp. 1–7.
- P. Lopez, “Grobid: Combining automatic bibliographic data recognition and term extraction for scholarship publications,” in Research and Advanced Technology for Digital Libraries, M. Agosti, J. Borbinha, S. Kapidakis, C. Papatheodorou, and G. Tsakonas, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009. http://dx.doi.org/https://doi.org/10.1007/978-3-642-04346-8_62. ISBN 978-3-642-04346-8 pp. 473–474.
- R.-Y. Cao, Y.-X. Cao, G.-B. Zhou, and P. Luo, “Extracting Variable-Depth Logical Document Hierarchy from Long Documents: Method, Evaluation, and Application,” Journal of Computer Science and Technology, vol. 37, no. 3, pp. 699–718, 2022. http://dx.doi.org/https://doi.org/10.1007/s11390-021-1076-7
- C. Neudecker, K. Baierer, M. Federbusch, M. Boenig, K.-M. Würzner, V. Hartmann, and E. Herrmann, “Ocr-d: An end-to-end open source ocr framework for historical printed documents,” in Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage, ser. DATeCH2019. New York, NY, USA: Association for Computing Machinery, 2019. http://dx.doi.org/10.1145/3322905.3322917. ISBN 9781450371940 p. 53–58. [Online]. Available: https://doi.org/10.1145/3322905.3322917
- A. P. Tafti, A. Baghaie, M. Assefi, H. R. Arabnia, Z. Yu, and P. Peissig, “OCR as a Service: An Experimental Evaluation of Google Docs OCR, Tesseract, ABBYY FineReader, and Transym,” in Advances in Visual Computing, G. Bebis, R. Boyle, B. Parvin, D. Koracin, F. Porikli, S. Skaff, A. Entezari, J. Min, D. Iwai, A. Sadagic, C. Scheidegger, and T. Isenberg, Eds. Cham: Springer International Publishing, 2016. http://dx.doi.org/https://doi.org/10.1007/978-3-319-50835-1_66. ISBN 978-3-319-50835-1 pp. 735–746.
- M. Lundqvist and A. Forsberg, “A comparison of OCR methods on natural images in different image domains,” 2020.
- Z. Shen, R. Zhang, M. Dell, B. C. G. Lee, J. Carlson, and W. Li, “Layoutparser: A unified toolkit for deep learning based document image analysis,” pp. 131–146, 2021. http://dx.doi.org/https://doi.org/10.1007/978-3-030-86549-8_9
- X. Zhong, J. Tang, and A. Jimeno Yepes, “Publaynet: Largest dataset ever for document layout analysis,” in 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019. http://dx.doi.org/10.1109/ICDAR.2019.00166 pp. 1015–1022.
- Z. Shen, K. Zhang, and M. Dell, “A large dataset of historical japanese documents with complex layouts,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2020. http://dx.doi.org/10.1109/CVPRW50498.2020.00282 pp. 2336–2343.
- B. Pfitzmann, C. Auer, M. Dolfi, A. S. Nassar, and P. W. J. Staar, “Doclaynet: A large human-annotated dataset for document-layout segmentation,” p. 3743–3751, 2022. http://dx.doi.org/10.1145/3534678.353904. [Online]. Available: https://doi.org/10.1145/3534678.3539043
- X. Zhong, E. ShafieiBavani, and A. J. Yepes, “Image-based table recognition: data, model, and evaluation,” 2020. [Online]. Available: https://arxiv.org/abs/1911.10683
- D. Prasad, A. Gadpal, K. Kapadni, M. Visave, and K. Sultanpure, “Cascadetabnet: An approach for end to end table detection and structure recognition from image-based documents,” 2020.