Czech parliament meeting recordings as ASR training data

Jan Oldřich Krůza

Czech parliament meeting recordings as ASR training data

Jan Oldřich Krůza

DOI: http://dx.doi.org/10.15439/2020F119

Citation: Proceedings of the 2020 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 21, pages 185–188 (2020)

Full text

Abstract. I present a way to leverage the stenographed recordings of the Czech parliament meetings for purposes of training a speech-to-text system. The article presents a method for scraping the data, acquiring word-level alignment and selecting reliable parts of the imprecise transcript. Finally, I present an ASR system trained on these and other data.

References

M. Korvas, O. Plátek, O. Dušek, L. Žilka, and F. Jurčı́ček, “Free english and czech telephone speech corpus,” 2014.
O. Plátek, O. Dušek, and F. Jurčı́ček, “Vystadial 2016 – czech data,” 2016, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. [Online]. Available: http://hdl.handle.net/11234/1-1740
M. Mikulová, J. Mı́rovský, A. Nedoluzhko, P. Pajas, J. Štěpánek, and J. Hajič, “Pdtsc 2.0-spoken corpus with rich multi-layer structural annotation,” in International Conference on Text, Speech, and Dialogue. Springer, 2017, pp. 129–137.
J. Hajič, P. Pajas, P. Ircing, J. Romportl, N. Peterek, M. Spousta, M. Mikulová, M. Grůber, and M. Legát, “Prague DaTabase of spoken czech 1.0,” 2017, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. [Online]. Available: http://hdl.handle.net/11234/1-2375
M. Grůber, “Czech senior COMPANION expressive speech corpus,” 2014, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. [Online]. Available: http://hdl.handle.net/11858/00-097C-0000-0023-1D76-9
L. Šmı́dl and A. Pražák, “OVM – otázky václava moravce,” 2013, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. [Online]. Available: http://hdl.handle.net/11858/00-097C-0000-000D-EC98-3
L. Šmı́dl, P. Stanislav, and V. Radová, “STAZKA – speech recordings from vehicles,” 2015, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. [Online]. Available: http://hdl.handle.net/11234/1-1510
O. Krůza and N. Peterek, “Making community and asr join forces in web environment,” in International Conference on Text, Speech and Dialogue. Springer, 2012, pp. 415–421.
O. Krůza, “Spoken corpus of karel makoň,” 2012, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. [Online]. Available: http://hdl.handle.net/11372/LRT-1455
A. Pražák, J. V. Psutka, J. Hoidekr, J. Kanis, L. Müller, and J. Psutka, “Automatic online subtitling of the czech parliament meetings,” in International Conference on Text, Speech and Dialogue. Springer, 2006, pp. 501–508.
A. Pražák and L. Šmı́dl, “Czech parliament meetings,” 2012, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. [Online]. Available: http://hdl.handle.net/11858/00-097C-0000-0005-CF9C-4
P. J. Moreno, C. Joerg, J.-M. V. Thong, and O. Glickman, “A recursive algorithm for the forced alignment of very long audio segments,” in Fifth International Conference on Spoken Language Processing, 1998.
T. J. Hazen, “Automatic alignment and error correction of human generated transcripts for long speech recordings,” in Ninth International Conference on Spoken Language Processing, 2006.
A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates et al., “Deep speech: Scaling up end-to-end speech recognition,” arXiv preprint https://arxiv.org/abs/1412.5567, 2014.
W. Byrne, J. Hajič, P. Ircing, F. Jelinek, S. Khudanpur, J. McDonough, N. Peterek, and J. Psutka, “Large vocabulary speech recognition for read and broadcast czech,” in International Workshop on Text, Speech and Dialogue. Springer, 1999, pp. 235–240.
L. Benešová, M. Křen, and M. Waclawičová, “Korpus spontánnı́ mluvené češtiny oral2013,” Časopis pro modernı́ filologii (Journal for Modern Philology), vol. 1, no. 97, pp. 42–50, 2015.