Logo PTI Logo FedCSIS

Communication Papers of the 20th Conference on Computer Science and Intelligence Systems (FedCSIS)

Annals of Computer Science and Information Systems, Volume 45

Towards a German VET Archive and its Integration into a Data Warehouse

, ,

DOI: http://dx.doi.org/10.15439/2025F0064

Citation: Communication Papers of the 20th Conference on Computer Science and Intelligence Systems (FedCSIS), M. Bolanowski, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 45, pages 145152 ()

Full text

Abstract. This paper presents a systematic evaluation and prototypical implementation of an information system for historical vocational education and training (VET) regulations in Germany. The focus of this study is on integrating structured outputs with the German Labor Market Ontology (GLMO) and a broader labor market data warehouse. A corpus of VET and CVET regulations, as published in the Federal Gazette from 1969 to 2022, was used to assess the functional and semantic requirements of the archival process. This analysis was complemented by a review of existing software frameworks, culminating in the proposal of a combined architecture utilizing Omeka S and TEI Publisher. In addition, the necessary transformations, metadata enrichment, and ETL processes required to integrate the resulting TEI XML documents into a semantically linked data environment are detailed. This work provides a concrete roadmap for the sustainable digitization and semantic integration of regulatory texts into modern labor market intelligence infrastructures.

References

  1. T. Reiser, J. Dörpinghaus, P. Steiner, and M. Tiemann, “Towards a datatset of digitalized historical german vet and cvet regulations,” Data, vol. 9, no. 11, 2024.
  2. T. Reiser, J. Dörpinghaus, and P. Steiner, “Analyzing historical legal textcorpora: German vet and cvet regulations,” in INFORMATIK 2024. Gesellschaft für Informatik eV, 2024, pp. 2007–2018.
  3. T. Reiser, J. Dörpinghaus, and P. Steiner, “Learning from historical vet and cvet regulations in germany: What should vet look like and whom should it serve?” in NORDYRK 2024 BOOK OF ABSTRACTS, 2024, p. 75.
  4. M. Koistinen, K. Kettunen, and J. Kervinen, “How to Improve Optical Character Recognition of Historical Finnish Newspapers Using Open Source Tesseract OCR Engine,” Proc. of LTC, pp. 279–283, 2017.
  5. A. Nabizai and H.-G. Fill, “Eine Modellierungsmethode zur Visualisierung und Analyse von Gesetzestexten,” Jusletter IT, February 2017. [Online]. Available: http://eprints.cs.univie.ac.at/5131/
  6. V. N. Sai Rakesh Kamisetty, B. Sohan Chidvilas, S. Revathy, P. Jeyanthi, V. M. Anu, and L. Mary Gladence, “Digitization of Data from Invoice using OCR,” in 2022 6th International Conference on Computing Methodologies and Communication (ICCMC), 2022. https://dx.doi.org/10.1109/ICCMC53470.2022.9754117 pp. 1–10.
  7. H. Hamann, “The German Federal Courts Dataset 1950–2019: From Paper Archives to Linked Open Data,” Journal of empirical legal studies, vol. 16, no. 3, pp. 671–688, 2019. https://dx.doi.org/10.1111/jels.12230
  8. C. Reul, D. Christ, A. Hartelt, N. Balbach, M. Wehner, U. Springmann, C. Wick, C. Grundig, A. Büttner, and F. Puppe, “OCR4all—An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Histor- ical Printings,” Applied Sciences, vol. 9, no. 22, p. 4853, 2019. https://dx.doi.org/10.3390/app9224853
  9. J. M. Jayoma, E. S. Moyon, and E. M. O. Morales, “OCR Based Document Archiving and Indexing Using PyTesseract: A Record Management System for DSWD Caraga, Philippines,” in 2020 IEEE 12th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment, and Management (HNICEM), 2020. https://dx.doi.org/10.1109/HNICEM51456.2020.9400000 pp. 1–6.
  10. S. Van Nguyen, D. A. Nguyen, and L. S. Q. Pham, “Digitalization of Administrative Documents A Digital Transformation Step in Practice,” in 2021 8th NAFOSTED Conference on Information and Computer Science (NICS), 2021. https://dx.doi.org/10.1109/NICS54270.2021.9701547 pp. 519–524.
  11. S. Tsujimoto and H. Asada, “Major components of a complete text reading system,” Proceedings of the IEEE, vol. 80, no. 7, pp. 1133–1149, 1992. https://dx.doi.org/10.1109/5.156475
  12. J. v. Beusekom, D. Keysers, F. Shafait, and T. Breuel, “Example-based logical labeling of document title page images,” in Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), vol. 2, 2007. https://dx.doi.org/10.1109/ICDAR.2007.4377049 pp. 919–923.
  13. S. Klink and T. Kieninger, “Rule-based document structure understanding with a fuzzy combination of layout and textual features,” International Journal on Document Analysis and Recognition, vol. 4, no. 1, pp. 18–26, 2001. https://dx.doi.org/10.1007/PL00013570
  14. P. Pathirana, A. Silva, T. Lawrence, T. Weerasinghe, and R. Abeyweera, “A comparative evaluation of pdf-to-html conversion tools,” in 2023 International Research Conference on Smart Computing and Systems Engineering (SCSE), vol. 6, 2023. https://dx.doi.org/10.1109/SCSE59836.2023.10214989 pp. 1–7.
  15. P. Lopez, “Grobid: Combining automatic bibliographic data recognition and term extraction for scholarship publications,” in Research and Advanced Technology for Digital Libraries, M. Agosti, J. Borbinha, S. Kapidakis, C. Papatheodorou, and G. Tsakonas, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009. https://dx.doi.org/10.1007/978-3-642-04346-8_62. ISBN 978-3-642-04346-8 pp. 473–474.
  16. R. Altenhöner, A. Berger, C. Bracht, P. Klimpel, S. Meyer, A. Neuburger, T. Stäcker, and R. Stein, “DFG-Praxisregeln "Digitalisierung". Aktualisierte Fassung 2022.” Feb. 2023. [Online]. Available: https://doi.org/10.5281/zenodo.7435724
  17. W. Meier, “exist: An open source native xml database,” in Web, Web-Services, and Database Systems, A. B. Chaudhri, M. Jeckle, E. Rahm, and R. Unland, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2003. https://dx.doi.org/10.1007/3-540-36560-5_13. ISBN 978-3-540-36560-0 pp. 169–183.
  18. P. Christen, Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer Publishing Company, Incorporated, 2012. ISBN 3642311636
  19. R. Altenhöner, A. Berger, C. Bracht, P. Klimpel, S. Meyer, A. Neuburger, T. Stäcker, and R. Stein, “DFG practical guidelines on digitisation. updated version 2022,” 2023.
  20. M. Söylemez, B. Tekinerdogan, and A. Kolukısa Tarhan, “Challenges and solution directions of microservice architectures: A systematic literature review,” Applied Sciences, vol. 12, no. 11, 2022. https://dx.doi.org/10.3390/app12115507. [Online]. Available: https://www.mdpi.com/2076-3417/12/11/5507
  21. B. Kim, S. Nakamura, and H. Watanave, “Using archivematica and omeka s for long-term preservation and access of digitized archive materials,” in From Born-Physical to Born-Virtual: Augmenting Intelligence in Digital Libraries, Y.-H. Tseng, M. Katsurai, and H. N. Nguyen, Eds. Cham: Springer International Publishing, 2022, pp. 241–250.
  22. M. Klindt and K. Amrhein, “One core preservation system for all your data. no exceptions!” in iPRES 2015 - Proceedings of the 12th International Conference on Preservation of Digital Objects, 2015, pp. 101 – 108. [Online]. Available: http://phaidra.univie.ac.at/o:429551
  23. J. Dörpinghaus and M. Tiemann, “Vocational education and training data in twitter: Making german twitter data interoperable,” Proceedings of the Association for Information Science and Technology, vol. 60, no. 1, pp. 946–948, 2023.
  24. M. Bolanowski, M. Ganzha, L. Maciaszek, M. Paprzycki, and D. Slezak, Eds., Communication Papers of the 19th Conference on Computer Science and Intelligence Systems (FedCSIS), 2024.