Big Data Language Model of Contemporary Polish

Krzysztof Wołk; Agnieszka Wołk; Krzysztof Marasek

Big Data Language Model of Contemporary Polish

Krzysztof Wołk, Agnieszka Wołk, Krzysztof Marasek

DOI: http://dx.doi.org/10.15439/2017F432

Citation: Proceedings of the 2017 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 11, pages 389–395 (2017)

Full text

Abstract. Based on big data training we provide 5-gram language models of contemporary Polish which are based on the Common Crawl corpus (which is a compilation of more than 9,000,000,000 pages from across the web) and other resources. We prove that our model is better than the Google WEB1T n-gram counts and assures better quality in terms of perplexity and machine translation. The model includes lower-counting entries and also de-duplication in order to lessen boilerplate. We also provide POS tagged version of raw corpus and raw corpus itself. We also provide dictionary of contemporary Polish. By maintaining singletons, Kneser-Ney smoothing in SRILM toolkit was used in order to construct big data language models. In this research, it is detailed exactly how the corpus was obtained and pre-processed, with a prominence on issues which surface when working with information on this scale. We train the language model and finally present advances of BLEU score in MT and perplexity values, through the utilization of our model.

References

Brants, T., Popat, A. C., Xu, P., Och, F. J., & Dean, J. (2007). Large language models in machine translation. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.
Guthrie, D., & Hepple, M. (2010, October). Storing the web in memory: Space efficient language models with constant time retrieval. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (pp. 262-272). Association for Computational Linguistics.
Chelba, C., & Schalkwyk, J. (2013). Empirical exploration of language modeling for the google. com query stream as applied to mobile voice search. In Mobile Speech and Advanced Natural Language Solutions (pp. 197-229). Springer New York, http://dx.doi.org/10.1007/978-1-4614-6018-3_8
Lenko-Szymanska, A., (2016). A corpus-based analysis of the development of phraseological competence in EFL learners using the CollGram profile. Paper presented at the 7 th Conference of the Formulaic Language Research Network (FLaRN), Vilnius, 28-30 June.
Brants, T., & Franz, A. (2006). Web 1T 5-gram corpus version 1.1. Google Inc.
Lin, D., Church, K., Ji, H., Sekine, S., Yarowsky, D., Bergsma, S., Patil, K., Pitler, E., Lathbury, R., Rao, V., Dalwani, K., & Narsale, S. (2010). Final report of the 2009 JHU CLSP workshop.
Bergsma, S., Pitler, E., & Lin, D. (2010, July). Creating robust supervised classifiers via web-scale N-gram data. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (pp. 865-874). Association for Computational Linguistics.
Lin, D., (2013). Personal communication, October
Wang, K., Thrasher, C., Viegas, E., Li, X., & Hsu, B. J. P. (2010, June). An overview of Microsoft Web N-gram corpus and applications. In Proceedings of the NAACL HLT 2010 Demonstration Session (pp. 45-48). Association for Computational Linguistics.
Swan, O. E. (2003). Polish Grammar in a Nutshell. University of Pittsburgh.
Choong, C., & Power, M. S. The Difference between Written and Spoken English. Assignment Unit, 1.
Daniels, P. T., & Bright, W. (1996). The world's writing systems. Oxford University Press on Demand.
Coleman, J. (2014). A speech is not an essay. Harvard Business Review.
Ager, S. (2013). Differences between writing and speech, Omniglot—the online encyclopedia of writing systems and languages.
Language detection library ported from Google's language- detection. https://pypi.python.org/pypi/langdetect?
Maziarz, M., Piasecki, M., & Szpakowicz, S. (2012). Approaching plWordNet 2.0. In Proceedings of 6th International Global Wordnet Conference, The Global WordNet Association (pp. 189-196).
Wołk, K., Marasek, K. (2014) Polish – English Speech Statistical Machine Translation Systems for the IWSLT 2014, Proceedings of the 11th International Workshop on Spoken Language Translation, Tahoe Lake, USA, p. 143- 149
Bergsma, S., Pitler, E., & Lin, D. (2010, July). Creating robust supervised classifiers via web-scale N-gram data. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (pp. 865-874). Association for Computational Linguistics.
Bojar, O., Buck, C., Callison-Burch, C., Federmann, C., Haddow, B., Koehn, P., Monz, C., Post, M., Soricut, R. & Specia, L. (2013). Findings of the 2013 Workshop on Statistical Machine Translation. In Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 1–44, Sofia, Bulgaria. Association for Computational Linguistics
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., ... & Dyer, C. (2007, June). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions (pp. 177-180). Association for Computational Linguistics.
Chen, S. F., & Goodman, J. (1996, June). An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th annual meeting on Association for Computational Linguistics (pp. 310-318). Association for Computational Linguistics, http://dx.doi.org/10.3115/981863.981904
Perplexity [Online]. Hidden Markov Model Toolkit website. Cambridge University Engineering Dept. Available: http://www1.icsi.berkeley.edu/Speech/docs/HTKBook3.2/node188_mn.html, retrieved on November 29, 2015.
Koehn, P., (2010) Moses, statistical machine translation system, user manual and code guide.
Jurafsky, D., [Online] Language modeling: Introduction to n-grams [Online]. Stanford University. Available: https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf, retrieved on November 29, 2015.
Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002, July). BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics (pp. 311-318). Association for Computational Linguistics.
Axelrod, A. (2006). Factored language models for statistical machine translation. DOI 10.1007/s10590-010-9082-5
Graves, A., & Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5), 602-610.
Stolcke, A. (2002, September). SRILM-an extensible language modeling toolkit. In Interspeech (Vol. 2002, p. 2002).
Junczys-Dowmunt, M., & Szał, A. (2012). Symgiza++: symmetrized word alignment models for statistical machine translation. In Security and Intelligent Information Systems (pp. 379-390). Springer Berlin Heidelberg, DOI: 10.1007/978-3-642-25261-7_30
Durrani, N., Sajjad, H., Hoang, H., & Koehn, P. (2014, April). Integrating an Unsupervised Transliteration Model into Statistical Machine Translation. In EACL (Vol. 14, pp. 148-153), http://dx.doi.org/10.3115/v1/E14-4029
Wołk, K., & Marasek, K. (2014). Real-time statistical speech translation. In New Perspectives in Information Systems and Technologies, Volume 1 (pp. 107-113). Springer International Publishing, http://dx.doi.org/10.1007/978-3-319-05951-8_11
Morfeusz Tagger, Available: http://sgjp.pl/morfeusz/morfeusz.html, retrieved on March 23, 2017