Open Vocabulary Keyword Spotting with Small-Footprint ASR-based Architecture and Language Models

DOI: http://dx.doi.org/10.15439/2023F8594

Abstract. We present the results of experiments on minimizing the model size for the text-based Open Vocabulary Keyword Spotting task. The main goal is to perform inference on devices with limited computing power, such as mobile phones. Our solution is based on the acoustic model architecture adopted from the automatic speech recognition task. We extend the acoustic model with a simple yet powerful language model, which improves recognition results without impacting latency and memory footprint. We also present a method to improve the recognition rate of rare keywords based on the recordings generated by a text-to-speech system. Evaluations using a public testset prove that our solution can achieve a true positive rate in the range of 73\%--86\%, with a false positive rate below 24\%. The model size is only 3.2 MB, and the real-time factor measured on contemporary mobile phones is 0.05.


