Logo PTI Logo FedCSIS

Communication Papers of the 19th Conference on Computer Science and Intelligence Systems (FedCSIS)

Annals of Computer Science and Information Systems, Volume 41

Semi-automatic annotation of Greek majuscule manuscripts: Steps towards integrated transcription and annotation

,

DOI: http://dx.doi.org/10.15439/2024F1772

Citation: Communication Papers of the 19th Conference on Computer Science and Intelligence Systems (FedCSIS), M. Bolanowski, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 41, pages 3744 ()

Full text

Abstract. We present a prototype for the integration of HTR transcription and semi-automated markup of textual features in the eScriptorium GUI.

References

  1. B. Alex, C. Grover, E. Klein, and R. Tobin, “Digitised historical text: Does it have to be mediocre?.” in KONVENS, 2012, pp. 401–409.
  2. P. Roelli and D. Bachmann, “Towards generating a stemma of complicated manuscript traditions: Petrus alfonsi’s dialogus,” Revue d’histoire des textes, vol. 5, pp. 307–321, 2010.
  3. Wikimedia, “Codex claromontanus, the greek text of colossians 1:28-2:3,” 2024, [Online; accessed July 20, 2024]. [Online]. Available: https://en.wikipedia.org/wiki/Codex_Claromontanus#/media/File:Claromontanus_2_greek.jpg
  4. P. B. Ströbel, S. Clematide, and M. Volk, “How much data do you need? about the creation of a ground truth for black letter and the effectiveness of neural ocr,” Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, 2020, 2020.
  5. J. Martínek, L. Lenc, and P. Král, “Training strategies for ocr systems for historical documents,” in Artificial Intelligence Applications and Innovations: 15th IFIP WG 12.5 International Conference, AIAI 2019, Hersonissos, Crete, Greece, May 24–26, 2019, Proceedings 15. Springer, 2019, pp. 362–373.
  6. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  7. A. Chagué, T. Clérice, J. Norindr, M. Humeau, B. Davoury, E. Van Kote, A. Mazoue, M. Faure, and S. Doat, “Manu mcfrench, from zero to hero: impact of using a generic handwriting recognition model for smaller datasets,” in Digital Humanities 2023: Collaboration as Opportunity, 2023.
  8. P. Stokes and B. Kiessling, “Sharing data for handwritten text recognition (htr),” Digital Humanities in Practice, 2024.
  9. G. Chiron, A. Doucet, M. Coustaty, M. Visani, and J.-P. Moreux, “Impact of ocr errors on the use of digital libraries: towards a better access to information,” in 2017 ACM/IEEE joint conference on digital libraries (JCDL). IEEE, 2017, pp. 1–4.
  10. T. G. Collection, “Pierre médebielle s.c.j. gallica (auteur); salt: Histoire d’une mission (texte),” 2024, [Online; accessed July 20, 2024]. [Online]. Available: https://gallica.bnf.fr/ark:/12148/bpt6k91248315/f7.item#
  11. D. A. Smith and R. Cordell, “A research agenda for historical and multilingual optical character recognition,” NUlab, Northeastern University.@ https://ocr. northeastern. edu/report, p. 36, 2018.
  12. E. Pierazzo, “A rationale of digital documentary editions,” Literary and linguistic computing, vol. 26, no. 4, pp. 463–477, 2011.
  13. P. Sahle, “What is a scholarly digital edition?” Digital scholarly editing: Theories and practices, vol. 1, pp. 19–39, 2016.
  14. K. Pakhale, “Comprehensive overview of named entity recognition: Models, domain-specific applications and challenges,” arXiv preprint https://arxiv.org/abs/2309.14084, 2023.
  15. W. Riess, “Prolegomena zu einer digitalen althistorischen Gewaltforschung: Gewaltmuster bei Solon, Alkibiades und Arat im Vergleich,” Klio, vol. 102, no. 2, pp. 445–473, 2020.
  16. A. Przepiórkowski, “Tei p5 as an xml standard for treebank encoding,” in Proceedings of the Eighth International Workshop on Treebanks and Linguistic Theories (TLT8), 2009, pp. 149–160.
  17. S. A. and A. M. Del Grosso, “Giorgio bassani’s notes between tradition and innovation.” Digital Humanities 2023: Book of Abstracts, 2023.
  18. M. A. Cipolla, A. Cappellotto, M. Rospocher et al., “Collaboration practices between people and tools: the case of" snorra edda. a collaborative bibliography (snecb)",” in Digital Humanities 2023: Book of Abstracts, 2023, pp. 93–94.
  19. S. Moors, “Constrained. a computational study of the influence of formal characteristics on the transmission of the middle dutch martijn trilogy by jacob van maerlant.” Digital Humanities 2023: Book of Abstracts, 2023.
  20. E. Perdiki, “Preparing big manuscript data for hierarchical clustering with minimal htr training,” Journal of Data Mining & Digital Humanities, no. Sciences of Antiquity and digital humanities, 2023.
  21. Wikimedia, “Nomina sacra in codex vaticanus john 1,” 2024, [Online; accessed July 20, 2024]. [Online]. Available: https://commons.wikimedia.org/wiki/File:Nomina_Sacra_in_Codex_Vaticanus_John_1.jpg
  22. D. Jongkind, Scribal Habits of Codex Sinaiticus. Gorgias Press, 2013.
  23. A. Wilson, “Scribal habits in greek new testament manuscripts,” Filología neotestamentaria, vol. 24, pp. 95–126, 2011.
  24. R. Hanslo, “Deep learning transformer architecture for named-entity recognition on low-resourced languages: State of the art results,” in PROCEEDINGS OF THE 2022 17TH CONFERENCE ON COMPUTER SCIENCE AND INTELLIGENCE SYSTEMS (FEDCSIS), ser. Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, and D. Slezak, Eds., 2022, pp. 53–60.
  25. R. Sharma, D. Chauhan, and R. Sharma, “Named entity recognition system for the biomedical domain,” in proceedings of the 2022 17th conference on computer science and intelligence systems (fedcsis), ser. Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, and D. Slezak, Eds., 2022, pp. 837–840, 17th Conference on Computer Science and Intelligence Systems (FedCSIS), Sofia, Bulgaria, SEP 04-07, 2022.
  26. R. Hanslo, “Evaluation of neural network transformer models for named-entity recognition on low-resourced languages,” in PROCEEDINGS OF THE 2021 16TH CONFERENCE ON COMPUTER SCIENCE AND INTELLIGENCE SYSTEMS (FEDCSIS), ser. Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, and D. Slezak, Eds., 2021, pp. 115–119, 16th Conference on Computer Science and Intelligence Systems (FedCSIS), ELECTR NETWORK, SEP 02-05, 2021.
  27. B. Kiessling, “Kraken-an universal text recognizer for the humanities,” in Proceedings of the DH2019 Conference, 2019.
  28. G. G. Celano, “Opera graeca adnotata: Building a 34m+ token multilayer corpus for ancient greek,” arXiv preprint https://arxiv.org/abs/2404.00739, 2024.
  29. C. v. Tischendorf, Novum Testamentum graece. Leipzig, 1841.
  30. C. Geldhauser, “Artificially created image files resembling ancient greek manuscripts in majuscule script (version v1) [data set], zenodo.” 2024, [Online; accessed July 23, 2024]. [Online]. Available: https://doi.org/10.5281/zenodo.12755706