Open Vocabulary Keyword Spotting with Small-Footprint ASR-based Architecture and Language Models

Mikołaj Pudo; Mateusz Wosik; Artur Janicki

Open Vocabulary Keyword Spotting with Small-Footprint ASR-based Architecture and Language Models

Mikołaj Pudo, Mateusz Wosik, Artur Janicki

DOI: http://dx.doi.org/10.15439/2023F8594

Citation: Proceedings of the 18th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 35, pages 657–666 (2023)

Full text

Abstract. We present the results of experiments on minimizing the model size for the text-based Open Vocabulary Keyword Spotting task. The main goal is to perform inference on devices with limited computing power, such as mobile phones. Our solution is based on the acoustic model architecture adopted from the automatic speech recognition task. We extend the acoustic model with a simple yet powerful language model, which improves recognition results without impacting latency and memory footprint. We also present a method to improve the recognition rate of rare keywords based on the recordings generated by a text-to-speech system. Evaluations using a public testset prove that our solution can achieve a true positive rate in the range of 73\%--86\%, with a false positive rate below 24\%. The model size is only 3.2 MB, and the real-time factor measured on contemporary mobile phones is 0.05.

References

I. López-Espejo, Z.-H. Tan, J. H. L. Hansen, and J. Jensen, “Deep spoken keyword spotting: An overview,” IEEE Access, vol. 10, pp. 4169–4199, 2022. http://dx.doi.org/10.1109/ACCESS.2021.3139508
P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” 2018. [Online]. Available: https://arxiv.org/abs/1804.03209
J. Rohlicek, W. Russell, S. Roukos, and H. Gish, “Continuous hidden markov modeling for speaker-independent word spotting,” in International Conference on Acoustics, Speech, and Signal Processing,, 1989. http://dx.doi.org/10.1109/ICASSP.1989.266505 pp. 627–630 vol.1.
J. Wilpon, L. Miller, and P. Modi, “Improvements and applications for key word recognition using hidden markov modeling techniques,” in Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP 1991), 1991. http://dx.doi.org/10.1109/ICASSP.1991.150338 pp. 309–312 vol.1.
I.-F. Chen and C.-H. Lee, “A hybrid HMM/DNN approach to keyword spotting of short words,” in Proc. Interspeech 2013, 2013. http://dx.doi.org/10.21437/Interspeech.2013-397 pp. 1574–1578.
S. Panchapagesan, M. Sun, A. Khare, S. Matsoukas, A. Mandal, B. Hoffmeister, and S. Vitaladevuni, “Multi-Task Learning and Weighted Cross-Entropy for DNN-Based Keyword Spotting,” in Proc. Interspeech 2016, 2016. http://dx.doi.org/10.21437/Interspeech.2016-1485 pp. 760–764.
A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural ’networks,” in Proc. 23rd International Conference on Machine Learning (ICML 2006), vol. 2006, 01 2006. http://dx.doi.org/10.1145/1143844.1143891 pp. 369–376.
J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” Advances in neural information processing systems, vol. 28, 2015.
K. Hwang, M. Lee, and W. Sung, “Online keyword spotting with a character-level recurrent neural network,” 2015.
Y. Zhuang, X. Chang, Y. Qian, and K. Yu, “Unrestricted Vocabulary Keyword Spotting Using LSTM-CTC,” in Proc. Interspeech 2016, 2016. http://dx.doi.org/10.21437/Interspeech.2016-753 pp. 938–942.
S. Sigtia, P. Clark, R. Haynes, H. Richards, and J. Bridle, “Multi-task learning for voice trigger detection,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2020). IEEE, may 2020. http://dx.doi.org/10.1109/icassp40776.2020.9053577
T. Bluche and T. Gisselbrecht, “Predicting Detection Filters for Small Footprint Open-Vocabulary Keyword Spotting,” in Proc. Interspeech 2020, 2020. http://dx.doi.org/10.21437/Interspeech.2020-1186 pp. 2552–2556.
Y. He, R. Prabhavalkar, K. Rao, W. Li, A. Bakhtin, and I. McGraw, “Streaming small-footprint keyword spotting using sequence-to-sequence models,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2017), 2017. http://dx.doi.org/10.1109/ASRU.2017.8268974 pp. 474–481.
A. Berg, M. O’Connor, and M. T. Cruz, “Keyword Transformer: A Self-Attention Model for Keyword Spotting,” in Proc. Interspeech 2021, 2021. http://dx.doi.org/10.21437/Interspeech.2021-1286 pp. 4249–4253.
A. Awasthi, K. Kilgour, and H. Rom, “Teaching Keyword Spotters to Spot New Keywords with Limited Examples,” in Proc. Interspeech 2021, 2021. http://dx.doi.org/10.21437/Interspeech.2021-1395 pp. 4254–4258.
M. Mazumder, C. Banbury, J. Meyer, P. Warden, and V. J. Reddi, “FewShot Keyword Spotting in Any Language,” in Proc. Interspeech 2021, 2021. http://dx.doi.org/10.21437/Interspeech.2021-1966 pp. 4214–4218.
L. Lugosch, S. Myer, and V. S. Tomar, “Donut: Ctc-based query-by-example keyword spotting,” arXiv preprint https://arxiv.org/abs/1811.10736, 2018.
B. Kim, M. Lee, J. Lee, Y. Kim, and K. Hwang, “Query-by-example on-device keyword spotting,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2019), 12 2019. http://dx.doi.org/10.1109/ASRU46091.2019.9004014 pp. 532–538.
J. Huang, W. Gharbieh, H. S. Shim, and E. Kim, “Query-by-example keyword spotting system using multi-head attention and soft-triple loss,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2021), 2021. http://dx.doi.org/10.1109/ICASSP39728.2021.9414156 pp. 6858–6862.
J. Huang, W. Gharbieh, Q. Wan, H. S. Shim, and H. C. Lee, “QbyE-MLPMixer: Query-by-Example Open-Vocabulary Keyword Spotting using MLPMixer,” in Proc. Interspeech 2022, 2022. http://dx.doi.org/10.21437/Interspeech.2022-11080 pp. 5200–5204.
S. Settle, K. Levin, H. Kamper, and K. Livescu, “Query-by-Example Search with Discriminative Neural Acoustic Word Embeddings,” in Proc. Interspeech 2017, 2017. http://dx.doi.org/10.21437/Interspeech.2017-1592 pp. 2874–2878.
G. Chen, C. Parada, and T. N. Sainath, “Query-by-example keyword spotting using long short-term memory networks,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2015), 2015. http://dx.doi.org/10.1109/ICASSP.2015.7178970 pp. 5236–5240.
C. Chiu and C. Raffel, “Monotonic chunkwise attention,” CoRR, vol. abs/1712.05382, 2017. [Online]. Available: http://arxiv.org/abs/1712.05382
J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A survey,” International Journal of Computer Vision, vol. 129, no. 6, p. 1789–1819, jun 2021. http://dx.doi.org/10.1007/s11263-021-01453-z
S. Kim, T. Hori, and S. Watanabe, “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017), 03 2017. http://dx.doi.org/10.1109/ICASSP.2017.7953075 pp. 4835–4839.
R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” CoRR, vol. abs/1508.07909, 2015. [Online]. Available: http://arxiv.org/abs/1508.07909
J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017), 2017. http://dx.doi.org/10.1109/ICASSP.2017.7952261 pp. 776–780.
D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noise corpus,” arXiv preprint https://arxiv.org/abs/1510.08484, 2015.
D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” in Proc. Interspeech 2019, 2019. http://dx.doi.org/10.21437/Interspeech.2019-2680 pp. 2613–2617.
V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2015), 2015. http://dx.doi.org/10.1109/ICASSP.2015.7178964 pp. 5206–5210.
R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” in International Conference on Language Resources and Evaluation, 2019.
M. Pudo, M. Wosik, A. Cieślak, J. Krzywdziak, B. Łukasiak, and A. Janicki, “MOCKS 1.0: Multilingual open custom keyword spotting testset,” in Proc. Interspeech 2023, in press.
“Keyword spotting on google speech commands,” https://paperswithcode.com/sota/keyword-spotting-on-google-speech-commands, 2023, [Online; accessed 19-May-2023].
M. Morise, F. Yokomori, and K. Ozawa, “World: A vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Transactions on Information and Systems, vol. E99.D, pp. 1877–1884, 07 2016. http://dx.doi.org/10.1587/transinf.2015EDP7457
J.-M. Valin and J. Skoglund, “A Real-Time Wideband Neural Vocoder at 1.6kb/s Using LPCNet,” in Proc. Interspeech 2019, 2019. http://dx.doi.org/10.21437/Interspeech.2019-1255 pp. 3406–3410.
J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, R. A. Saurous, Y. Agiomvrgiannakis, and Y. Wu, “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018), 2018. doi: 10.1109/ICASSP.2018.8461368 pp. 4779–4783.
N. Ellinas, G. Vamvoukakis, K. Markopoulos, A. Chalamandaris, G. Maniati, P. Kakoulidis, S. Raptis, J. S. Sung, H. Park, and P. Tsiakoulis, “High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latency,” in Proc. Interspeech 2020, 2020. http://dx.doi.org/10.21437/Interspeech.2020-2464 pp. 2022–2026.
P. Liu, X. Wu, S. Kang, G. Li, D. Su, and D. Yu, “Maximizing mutual information for tacotron,” ArXiv, vol. abs/1909.01145, 2019.
K. Ito and L. Johnson, “The LJ speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
E. Bakhturina, V. Lavrukhin, B. Ginsburg, and Y. Zhang, “Hi-Fi Multi-Speaker English TTS Dataset,” in Proc. Interspeech 2021, 2021. doi: 10.21437/Interspeech.2021-1599 pp. 2776–2780.
R. Vygon and N. Mikhaylovskiy, “Learning efficient representations for keyword spotting with triplet loss,” in Speech and Computer, A. Karpov and R. Potapova, Eds. Cham: Springer International Publishing, 2021. ISBN 978-3-030-87802-3 pp. 773–785.