Czech parliament meeting recordings as ASR training data
Jan Oldřich Krůza
DOI: http://dx.doi.org/10.15439/2020F119
Citation: Proceedings of the 2020 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 21, pages 185–188 (2020)
Abstract. I present a way to leverage the stenographed recordings of the Czech parliament meetings for purposes of training a speech-to-text system. The article presents a method for scraping the data, acquiring word-level alignment and selecting reliable parts of the imprecise transcript. Finally, I present an ASR system trained on these and other data.
References
- M. Korvas, O. Plátek, O. Dušek, L. Žilka, and F. Jurčı́ček, “Free english and czech telephone speech corpus,” 2014.
- O. Plátek, O. Dušek, and F. Jurčı́ček, “Vystadial 2016 – czech data,” 2016, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. [Online]. Available: http://hdl.handle.net/11234/1-1740
- M. Mikulová, J. Mı́rovský, A. Nedoluzhko, P. Pajas, J. Štěpánek, and J. Hajič, “Pdtsc 2.0-spoken corpus with rich multi-layer structural annotation,” in International Conference on Text, Speech, and Dialogue. Springer, 2017, pp. 129–137.
- J. Hajič, P. Pajas, P. Ircing, J. Romportl, N. Peterek, M. Spousta, M. Mikulová, M. Grůber, and M. Legát, “Prague DaTabase of spoken czech 1.0,” 2017, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. [Online]. Available: http://hdl.handle.net/11234/1-2375
- M. Grůber, “Czech senior COMPANION expressive speech corpus,” 2014, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. [Online]. Available: http://hdl.handle.net/11858/00-097C-0000-0023-1D76-9
- L. Šmı́dl and A. Pražák, “OVM – otázky václava moravce,” 2013, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. [Online]. Available: http://hdl.handle.net/11858/00-097C-0000-000D-EC98-3
- L. Šmı́dl, P. Stanislav, and V. Radová, “STAZKA – speech recordings from vehicles,” 2015, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. [Online]. Available: http://hdl.handle.net/11234/1-1510
- O. Krůza and N. Peterek, “Making community and asr join forces in web environment,” in International Conference on Text, Speech and Dialogue. Springer, 2012, pp. 415–421.
- O. Krůza, “Spoken corpus of karel makoň,” 2012, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. [Online]. Available: http://hdl.handle.net/11372/LRT-1455
- A. Pražák, J. V. Psutka, J. Hoidekr, J. Kanis, L. Müller, and J. Psutka, “Automatic online subtitling of the czech parliament meetings,” in International Conference on Text, Speech and Dialogue. Springer, 2006, pp. 501–508.
- A. Pražák and L. Šmı́dl, “Czech parliament meetings,” 2012, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. [Online]. Available: http://hdl.handle.net/11858/00-097C-0000-0005-CF9C-4
- P. J. Moreno, C. Joerg, J.-M. V. Thong, and O. Glickman, “A recursive algorithm for the forced alignment of very long audio segments,” in Fifth International Conference on Spoken Language Processing, 1998.
- T. J. Hazen, “Automatic alignment and error correction of human generated transcripts for long speech recordings,” in Ninth International Conference on Spoken Language Processing, 2006.
- A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates et al., “Deep speech: Scaling up end-to-end speech recognition,” arXiv preprint https://arxiv.org/abs/1412.5567, 2014.
- W. Byrne, J. Hajič, P. Ircing, F. Jelinek, S. Khudanpur, J. McDonough, N. Peterek, and J. Psutka, “Large vocabulary speech recognition for read and broadcast czech,” in International Workshop on Text, Speech and Dialogue. Springer, 1999, pp. 235–240.
- L. Benešová, M. Křen, and M. Waclawičová, “Korpus spontánnı́ mluvené češtiny oral2013,” Časopis pro modernı́ filologii (Journal for Modern Philology), vol. 1, no. 97, pp. 42–50, 2015.