Logo PTI Logo FedCSIS

Proceedings of the 18th Conference on Computer Science and Intelligence Systems

Annals of Computer Science and Information Systems, Volume 35

EpiDoc Data Matching for Federated Information Retrieval in the Humanities

, , , ,

DOI: http://dx.doi.org/10.15439/2023F1515

Citation: Proceedings of the 18th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 35, pages 10691074 ()

Full text

Abstract. The importance of federated information retrieval (FIR) is growing in humanities research. Unlike traditional centralized information retrieval methods, where searches are conducted within a logically centralised collection of documents, FIR treats each information system as an independent source with its own unique characteristics. Searching these systems together as a centralised source results in lower precision in humanities research, even when the research data itself is structured and stored according to standardised guidelines such as EpiDoc, and requires the need to be able to trace the origin of records to avoid incorrect historical conclusions. Matching of queries against all data sets in each source is proving less effective. A global search index that enables traceable matching of key values deemed relevant would provide a more robust solution here. In this paper, we propose a solution that introduces a novel EpiDoc data matching procedure, facilitating traceable FIR across distinct epigraphic sources.

References

  1. Algergawy, A., Nayak, R., Saake, G.: Element similarity measures in XML schema matching. Inf. Sci. 180(24), 4975–4998 (2010)
  2. Baxter, R., Christen, P., Churches, T.: A Comparison of Fast Blocking Methods for Record Linkage. Workshop on Data Cleaning, Record Linkage and Object Consolidation at the Ninth ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington DC (2003)
  3. Carbon, J.M., Peels-Matthey, S., Pirenne-Delforge, V.: Collection of Greek Ritual Norms (CGRN) (2017-, consulted on 10/05/2023). https://doi.org/https://doi.org/10.54510/CGRN0, http://cgrn.ulg.ac.be
  4. Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer Publishing Company, Incorporated (2012)
  5. Demeester, T., Nguyen, D., Trieschnigg, D., Develder, C., Hiemstra, D.: Snippet-Based Relevance Predictions for Federated Web Search. In: Serdyukov, P., Braslavski, P., Kuznetsov, S.O., Kamps, J., Rüger, S., Agichtein, E., Segalovich, I., Yilmaz, E. (eds.) Advances in Information Retrieval. pp. 697–700. Springer Berlin Heidelberg (2013)
  6. Elliott, T., Au, Z., Bodard, G., Cayless, H., Lanz, C., Lawrence, F., Vanderbilt, S., Viglianti, R., et al.: EpiDoc Reference Stylesheets (version 9). Available: https://sourceforge.net/p/epidoc/wiki/Stylesheets/((2008-2017)), accessed January 22, 2022
  7. Elliott, T., Bodard, G., Mylonas, E., Stoyanova, S., Tupman, C., Vanderbilt, S., et al.: EpiDoc Guidelines: Ancient documents in TEI XML (Version 9). Available: https://epidoc.stoa.org/gl/latest/. ((2007-2022)), accessed January 22, 2022
  8. Jaccard: The distribution of the flora of the alpine zone. In: New Phytologist. vol. 11, pp. 37–50 (1912)
  9. Jacobs, J.: Finding words that sound alike. The SOUNDEX algorithm. Byte 7 pp. 473–474 (1982)
  10. Lee, M.L., Yang, L.H., Hsu, W., Yang, X.: Xclust: clustering xml schemas for effective integration. In: Proceedings of the eleventh international conference on Information and knowledge management. pp. 292–299 (2002)
  11. Melzer, S., Peukert, H., Wang, H., Thiemann, S.: Model-based Development of a Federated Database Infrastructure to support the Usability of Cross-Domain Information Systems. In: IEEE International Systems Conference (SysCon 2022), Montreal, Canada. IEEE (2022)
  12. Melzer, S., Schiff, S., Weise, F., Harter, K., Möller, R.: Databasing on demand for research data repositories explained with a large epidoc dataset. CENTERIS (2022)
  13. Melzer, S., Thiemann, S., Möller, R.: Modeling and simulating federated databases for early validation of federated searches using the broker-based sysml toolbox. In: IEEE International Systems Conference, SysCon 2021, Vancouver, BC, Canada, April 15 - May 15, 2021. pp. 1–6. IEEE (2021)
  14. Miller, F.P., Vandome, A.F., McBrewster, J.: Levenshtein Distance: Information Theory, Computer Science, String (Computer Science), String Metric, Damerau-Levenshtein Distance, Spell Checker, Hamming Distance. Alpha Press (2009)
  15. Odell, M.K., Russell, R.: Patent numbers 1261167 (1918) and 1435663 (1922). Washington, DC: US Patent Office (1918)
  16. OpenAI: ChatGPT (Vers. 3.5). https://openai.com (2021)
  17. Pergantis, M., Varlamis, I., Giannakoulopoulos, A.: User Evaluation and Metrics Analysis of a Prototype Web-Based Federated Search Engine for Art and Cultural Heritage. Information 13(6), 285 (Jun 2022)
  18. Rahm, E., Do, H.H., Massmann, S.: Matching large xml schemas. ACM SIGMOD Record 33(4), 26–31 (2004)
  19. Schiff, S., Melzer, S., Wilden, E., Möller, R.: TEI-Based Interactive Critical Editions. In: Uchida, S., Barney, E., Eglin, V. (eds.) Document Analysis Systems. pp. 230–244. Springer International Publishing, Cham (2022)
  20. Shokouhi, M., Baillie, M., Azzopardi, L.: Updating Collection Representations for Federated Search. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. p. 511–518. SIGIR ’07, Association for Computing Machinery, New York, NY, USA (2007)
  21. Shokouhi, M., Si, L.: Federated Search. Foundations and Trends® in Information Retrieval 5(1), 1–102 (2011)
  22. Text Encoding Initiative: P5: Guidelines for Electronic Text Encoding and Interchange, Version 4.0.0. https://tei-c.org/Vault/P5/4.0.0/doc/tei-p5-doc/en/html/ (2020), accessed 29 June 2022
  23. Universität Hamburg: Epigraphische Datenbank zum antiken Kleinasien (2013-2016), https://www.epigraphik.uni-hamburg.de/content/index.xml
  24. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Attention is All you Need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017)