Spoken Language Corpora Augmentation with Domain-Specific Voice-Cloned Speech

Mateusz Czyżnikiewicz; Łukasz Bondaruk; Jakub Kubiak; Adam Wiącek; Łukasz Degórski; Marek Kubis; Paweł Skórzewski

Spoken Language Corpora Augmentation with Domain-Specific Voice-Cloned Speech

Mateusz Czyżnikiewicz, Łukasz Bondaruk, Jakub Kubiak, Adam Wiącek, Łukasz Degórski, Marek Kubis, Paweł Skórzewski

DOI: http://dx.doi.org/10.15439/2024F9637

Citation: Proceedings of the 19th Conference on Computer Science and Intelligence Systems (FedCSIS), M. Bolanowski, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 39, pages 579–584 (2024)

Full text

Abstract. In this paper we study the impact of augmenting spoken language corpora with domain-specific synthetic samples for the purpose of training a speech recognition system. Using both a conventional neural text-to-speech system and a zero-shot one with voice cloning ability we generate speech corpora that vary in the number of voices. We compare speech recognition models trained with addition of different amounts of synthetic data generated using these two methods with a baseline model trained solely on voice recordings. We show that while the quality of voice-cloned dataset is lower, its increased multivoiceity makes it much more effective than the one with only a few voices synthesized with the use of a conventional neural text-to-speech system. Furthermore, our experiments indicate that using low variability synthetic speech quickly leads to saturation in the quality of the ASR whereas high variability speech provides improvement even when increasing total amount of data used for training by 30\%.

References

A. Fazel, W. Yang, Y. Liu, R. Barra-Chicote, Y. Meng, R. Maas, and J. Droppo, “Synthasr: Unlocking synthetic data for speech recognition,” 2021.
S. Ueno, M. Mimura, S. Sakai, and T. Kawahara, “Data augmentation for asr using tts via a discrete representation,” in IEEE Automatic Speech Recognition and Understanding Workshop, 2021, pp. 68–75.
N. Rossenbach, M. Zeineldeen, B. Hilmes, R. Schlüter, and H. Ney, “Comparing the benefit of synthetic training data for various automatic speech recognition architectures,” in IEEE Automatic Speech Recognition and Understanding Workshop, 2021, pp. 788–795.
X. Tan, T. Qin, F. Soong, and T.-Y. Liu, “A survey on neural speech synthesis,” 2021.
J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, R. A. Saurous, Y. Agiomvrgiannakis, and Y. Wu, “Natural TTS synthesis by conditioning Wavenet on MEL spectrogram predictions,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2018, pp. 4779–4783.
Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech: Fast, robust and controllable text to speech,” 2019.
C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li, L. He, S. Zhao, and F. Wei, “Neural codec language models are zero-shot text to speech synthesizers,” 2023.
Z. Zhang, L. Zhou, C. Wang et al., “Speak foreign languages with your own voice: Cross-lingual neural codec language modeling,” 2023.
K. Shen, Z. Ju, X. Tan, Y. Liu, Y. Leng, L. He, T. Qin, S. Zhao, and J. Bian, “Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers,” 2023.
M. Bartelds, N. San, B. McDonnell et al., “Making more of little data: Improving low-resource automatic speech recognition using data augmentation,” in Proc.Annual Meeting of the Association for Computational Linguistics, 2023, pp. 715–729.
N. Rossenbach, A. Zeyer, R. Schlüter, and H. Ney, “Generating synthetic audio data for attention-based speech recognition systems,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2020, pp. 7069–7073.
X. Zheng, Y. Liu, D. Gunceler, and D. Willett, “Using synthetic audio to improve the recognition of out-of-vocabulary words in end-to-end asr systems,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2021, pp. 5674–5678.
M. Kubis, P. Skórzewski, M. Sowański, and T. Ziętkiewicz, “Back Transcription as a Method for Evaluating Robustness of Natural Language Understanding Models to Speech Recognition Errors,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, December 2023, pp. 11 824–11 835.
——, “Center for Artificial Intelligence Challenge on Conversational AI Correctness,” in Proceedings of the 18th Conference on Computer Science and Intelligence Systems, ser. Annals of Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, and D. Ślęzak, Eds., vol. 35. IEEE, 2023, pp. 1319–1324.
K. Yang, T.-Y. Hu, J.-H. R. Chang, H. Swetha Koppula, and O. Tuzel, “Text is all you need: Personalizing asr models using controllable speech synthesis,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2023, pp. 1–5.
T.-Y. Hu, M. Armandpour, A. Shrivastava, J.-H. R. Chang, H. Koppula, and O. Tuzel, “Synt++: Utilizing imperfect synthetic data to improve speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2022, pp. 7682–7686.
M. Le, A. Vyas, B. Shi et al., “Voicebox: Text-guided multilingual universal speech generation at scale,” in Advances in Neural Information Processing Systems, vol. 36, 2023, pp. 14 005–14 034.
V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2015, pp. 5206–5210.
K. Ito and L. Johnson, “The lj speech dataset,” 2017.
E. Bakhturina, V. Lavrukhin, B. Ginsburg, and Y. Zhang, “Hi-Fi Multi-Speaker English TTS Dataset,” in Interspeech, 2021, pp. 2776–2780.
H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “Libritts: A corpus derived from librispeech for text-to-speech,” CoRR, vol. abs/1904.02882, 2019.
H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in Oriental COCOSDA 2017, 2017, p. Submitted.
Y. Shi, H. Bu, X. Xu, S. Zhang, and M. Li, “AISHELL-3: A multi-speaker mandarin TTS corpus and the baselines,” CoRR, vol. abs/2010.11567, 2020.
R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” in Proc. 12th Conference on Language Resources and Evaluation, 2020, pp. 4211–4215.
A. Conneau, M. Ma, S. Khanuja, Y. Zhang, V. Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “Fleurs: Few-shot learning evaluation of universal representations of speech,” 2022.
E. Bastianelli, A. Vanzo, P. Swietojanski, and V. Rieser, “SLURP: A spoken language understanding resource package,” in Proc. 2020 Conference on Empirical Methods in Natural Language Processing, 2020, pp. 7252–7262.
C. Veaux, J. Yamagishi, and K. MacDonald, “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit,” 2019.
J. Park, S. Jin, J. Park et al., “Conformer-based on-device streaming speech recognition with kd compression and two-pass architecture,” in IEEE Spoken Language Technology Workshop, 2023, pp. 92–99.
M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Transactions on Information and Systems, vol. E99.D, no. 7, pp. 1877–1884, 2016.
J.-M. Valin and J. Skoglund, “A real-time wideband neural vocoder at 1.6 kb/s using LPCNet,” in Interspeech, 2019.
Y. Wang, R. J. Skerry-Ryan, D. Stanton et al., “Tacotron: Towards end-to-end speech synthesis,” in Interspeech, 2017.
N. Ellinas, G. Vamvoukakis, K. Markopoulos et al., “High quality streaming speech synthesis with low, sentence-length-independent latency,” in Interspeech, 2020, pp. 2022–2026.
A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “High fidelity neural audio compression,” 2022.
Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” 2019.
M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” in Proc. 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 7871–7880.
J. Tiedemann and S. Thottingal, “OPUS-MT - Building open translation services for the World,” in Proc. 22nd Annual Conferenec of the European Association for Machine Translation, 2020.
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” 2022.
Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” 2022.
J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 17 022–17 033.
R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based generative network for speech synthesis,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2019.
Y. Jia, Y. Zhang, R. Weiss et al., “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” in Advances in Neural Information Processing Systems, vol. 31, 2018.