Automated creation of parallel Bible corpora with cross-lingual semantic concordance

Jens Dörpinghaus; Carsten Düing

Automated creation of parallel Bible corpora with cross-lingual semantic concordance

Jens Dörpinghaus, Carsten Düing

DOI: http://dx.doi.org/10.15439/2021F30

Citation: Proceedings of the 16th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 25, pages 111–114 (2021)

Full text

Abstract. Here we present a novel approach for automated creation of parallel New Testament corpora with cross-lingual semantic concordance based on Strong's numbers. There is a lack of available digital Biblical resources for scholars. We present two approaches to tackle the problem, a dictionary-based approach and a CRF model and a detailed evaluation on annotated and non-annotated translations. We discuss a proof-of-concept based on English and German New Testament translations. The results presented in this paper are novel and according to our knowledge unique. They present promising performance, although further research is necessary.

References

S. Landes, C. Leacock, and R. I. Tengi, “Building semantic concordances,” WordNet: An electronic lexical database, vol. 199, no. 216, pp. 199–216, 1998.
B. Metzger, The Bible in Translation: Ancient and English Versions, ser. Biblical studies. Baker Publishing Group, 2001.
C. Clivaz, “Die bibel im digitalen zeitalter: Multimodale schriften in gemeinschaften,” Zeitschrift für Neues Testament, vol. 20, no. 39/40, pp. 35–57, 2017.
C. Anderson, “Digital humanities and the future of theology,” 2018.
C. Clivaz, A. Gregory, and D. Hamidović, Digital Humanities in Biblical, Early Jewish and Early Christian Studies. Brill, 2013.
M. Cysouw, C. Biemann, and M. Ongyerth, “Using strong’s numbers in the bible to test an automatic alignment of parallel texts,” STUF-language typology and universals, vol. 60, no. 2, pp. 158–171, 2007.
B. Wälchli, “Similarity semantics and building probabilistic semantic maps from parallel texts,” Linguistic Discovery, vol. 8, no. 1, pp. 331–371, 2010.
M. Simard, “Building and using parallel text for translation,” The Routledge Handbook of Translation and Technology, pp. 78–90, 2020.
A. Yli-Jyrä, J. Purhonen, M. Liljeqvist, A. Antturi, P. Nieminen, K. M. Räntilä, and V. Luoto, “Helfi: a hebrew-greek-finnish parallel bible corpus with cross-lingual morpheme alignment,” arXiv preprint https://arxiv.org/abs/2003.07456, 2020.
N. Rees and J. Riding, “Automatic concordance creation for texts in any language,” Proceedings of Translation and the Computer, vol. 31, 2009.
M. Diab and S. Finch, “A statistical word-level translation model for comparable corpora,” MARYLAND UNIV COLLEGE PARK INST FOR ADVANCED COMPUTER STUDIES, Tech. Rep., 2000.
P. Resnik, M. B. Olsen, and M. Diab, “The bible as a parallel corpus: Annotating the ‘book of 2000 tongues’,” Computers and the Humanities, vol. 33, no. 1, pp. 129–153, 1999.
C. Christodouloupoulos and M. Steedman, “A massively parallel corpus: the bible in 100 languages,” Language resources and evaluation, vol. 49, no. 2, pp. 375–395, 2015.
J. D. Riding, “Statistical glossing, language independent analysis in bible translation,” Translating and the Computer, vol. 30, 2008.
J. Renkema and C. van Wijk, “Converting the words of god: An experimental evaluation of stylistic choices in the new dutch bible translation,” Linguistica Antverpiensia, New Series–Themes in Translation Studies, no. 1, 2002.
L. De Vries, “Bible translation and primary orality,” The Bible Translator, vol. 51, no. 1, pp. 101–114, 2000.
G. G. Scorgie, M. L. Strauss, S. M. Voth et al., The challenge of Bible translation: Communicating God’s Word to the world. Zondervan Academic, 2009.
A. McMillan-Major, “Automating gloss generation in interlinear glossed text,” Proceedings of the Society for Computation in Linguistics, vol. 3, no. 1, pp. 338–349, 2020.
X. Zhao, S. Ozaki, A. Anastasopoulos, G. Neubig, and L. Levin, “Automatic interlinear glossing for under-resourced languages leveraging translations,” in Proceedings of the 28th International Conference on Computational Linguistics, 2020, pp. 5397–5408.
A. B. Muhammad, Annotation of conceptual co-reference and text mining the Qur’an. University of Leeds, 2012.
E. Biagetti, C. Zanchi, and W. M. Short, “Toward the creation of wordnets for ancient indo-european languages,” in Proceedings of the 11th Global Wordnet Conference, 2021, pp. 258–266.
V. Perrone, M. Palma, S. Hengchen, A. Vatri, J. Q. Smith, and B. McGillivray, “GASC: Genre-aware semantic change for Ancient Greek,” in Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change. Florence, Italy: Association for Computational Linguistics, Aug. 2019, pp. 56–66. [Online]. Available: https://www.aclweb.org/anthology/W19-4707
J. Lafferty, A. McCallum, and F. C. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” Proceedings of the Eighteenth International Conferenceon Machine Learning, 2001.