Tag and correct: high precision post-editing approach to speech recognition errors correction

Tomasz Ziętkiewicz

Tag and correct: high precision post-editing approach to speech recognition errors correction

Tomasz Ziętkiewicz

DOI: http://dx.doi.org/10.15439/2022F168

Citation: Proceedings of the 17th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 30, pages 939–942 (2022)

Full text

Abstract. This paper presents a new approach to the problem of correcting speech recognition errors by means of post-editing. It consists of using a neural sequence tagger that learns how to correct the ASR (Automatic Speech Recognition) hypothesis word by word and a corrector module that applies corrections returned by the tagger.The proposed solution is applicable to any ASR system, regardless of its architecture, and provides high-precision control over errors being corrected. This is especially crucial in production environments, where avoiding the introduction of new mistakes by the error correction model may be more important than the net gain in overall results. The results show that the performance of the proposed error correction models is comparable with previous approaches, while requiring much smaller resources to train, which makes it suitable for industrial applications, where both inference latency and training times are critical factors that limit the use of other techniques.

References

K. H. Davis, R. Biddulph, and S. Balashek, “Automatic recognition of spoken digits,” The Journal of the Acoustical Society of America, vol. 24, no. 6, pp. 637–642, 1952. [Online]. Available: https://doi.org/10.1121/1.1906946
W. Xiong, J. Droppo, X. Huang, F. Seide, M. L. Seltzer, A. Stolcke, D. Yu, and G. Zweig, “Achieving human parity in conversational speech recognition,” ArXiv, vol. abs/1610.05256, 2016.
V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210, 2015.
A. Narayanan, A. Misra, K. C. Sim, G. Pundak, A. Tripathi, M. Elfeky, P. Haghani, T. Strohman, and M. Bacchiani, “Toward domain-invariant speech recognition via large scale training,” in 2018 IEEE Spoken Language Technology Workshop (SLT), 2018, pp. 441–447.
A. Jeziorski, F. Sawicki, O. Solop, M. Junczyk, M. Sikora, and T. Zietkiewicz, “Industrial asr troubleshooting tool,” in Proceedings of the LREC2020 Industry Track. Marseille, France: European Language Resources Association (ELRA), May 2020, pp. 10–14.
R. Errattahi, A. El Hannani, and H. Ouahmane, “Automatic speech recognition errors detection and correction: A review,” Procedia Computer Science, vol. 128, pp. 32–37, 01 2018.
H. Cucu, A. Buzo, L. Besacier, and C. Burileanu, “Statistical Error Correction Methods for Domain-Specific ASR Systems,” in Statistical Language and Speech Processing, A.-H. Dediu, C. Martín-Vide, R. Mitkov, and B. Truthe, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 83–92.
J. Guo, T. N. Sainath, and R. J. Weiss, “A spelling correction model for end-to-end speech recognition,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 5651–5655. [Online]. Available: https://arxiv.org/pdf/1902.07178
Y. Leng, X. Tan, L. Zhu, J. Xu, R. Luo, L. Liu, T. Qin, X.-Y. Li, E. Lin, and T.-Y. Liu, “Fastcorrect: Fast error correction with edit alignment for automatic speech recognition,” 2021.
E. Malmi, S. Krause, S. Rothe, D. Mirylenka, and A. Severyn, “Encode, tag, realize: High-precision text editing,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 5054–5065. [Online]. Available: https://aclanthology.org/D19-1510
M. Kubis, Z. Vetulani, M. Wypych, and T. Ziętkiewicz, “Open challenge for correcting errors of speech recognition systems,” in Proceedings of the 9th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, Z. Vetulani and P. Paroubek, Eds. Poznań, Poland: Wydawnictwo Nauka i Innowacje, 2019, pp. 219–223. [Online]. Available: https://arxiv.org/abs/2001.03041 https://gonito.net/gitlist/asr-corrections.git/
D. E. M. John W. Ratcliff, “Pattern matching: The gestalt approach,” p. 46, 7 1988. [Online]. Available: https://www.drdobbs.com/database/pattern-matching-the-gestalt-approach/184407970
“difflib — helpers for computing deltas,” https://docs.python.org/3/library/difflib.html, accessed: 2022-03-20.
A. Akbik, T. Bergmann, D. Blythe, K. Rasul, S. Schweter, and R. Vollgraf, “FLAIR: An easy-to-use framework for stateof-the-art NLP,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations). Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 54–59. [Online]. Available: https://www.aclweb.org/anthology/N19-4010
B. Chan, S. Schweter, and T. Möller, “German’s next language model,” in Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain (Online): International Committee on Computational Linguistics, Dec. 2020, pp. 6788–6796. [Online]. Available: https://aclanthology.org/2020.coling-main.598
L. Martin, B. Muller, P. J. Ortiz Suárez, Y. Dupont, L. Romary, É. de la Clergerie, D. Seddah, and B. Sagot, “CamemBERT: a tasty French language model,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, Jul. 2020, pp. 7203–7219. [Online]. Available: https://www.aclweb.org/anthology/2020.acl-main.645
J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, and J. Pérez, “Spanish pre-trained bert model and evaluation data,” in PML4DC at ICLR 2020, 2020.
A. Akbik, D. Blythe, and R. Vollgraf, “Contextual string embeddings for sequence labeling,” in Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe, New Mexico, USA: Association for Computational Linguistics, Aug. 2018, pp. 1638–1649. [Online]. Available: https://aclanthology.org/C18-1139
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, pp. 1735–80, 12 1997.
J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proceedings of the Eighteenth International Conference on Machine Learning, ser. ICML ’01. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2001, pp. 282–289. [Online]. Available: http://dl.acm.org/citation.cfm?id=645530.655813