Using formant frequencies to word detection in recorded speech

Łukasz Laszko

DOI: http://dx.doi.org/10.15439/2016F518

Citation: Proceedings of the 2016 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 8, pages 797–801 (2016)

Full text

Abstract. The paper considers increasing the precision of detection of words in unsupervised keyword spotting method. The method is based on examining signal similarity of two analyzed media description: registered voice and a word (textual query) synthesized by using Text-to-Speech tools. The descriptions of media were given by a sequence of Mel-Frequency Cepstral Coefficients or Human-Factor Cepstral Coefficients. Dynamic Time Warping algorithm has been applied to provide time alignment of the given media descriptions. The detection involved classification method based on cost function, calculated upon signal similarity and alignment path. Potential false matches were eliminated in the algorithm by applying two-staged verification, using the Longest Common Subsequence algorithm and analyzing formant frequencies of eleven English monophthons. The use of formant frequencies at the stage of verification increased overall detection precision by about 10\% as compared to original algorithm.

References

Ł. Laszko, “Word detection in recorded speech using textual queries”, Proceedings of the 2015 Federated Conference on Computer Science and Information Systems, 2015, pp. 849-853, http://dx.doi.org/10.15439/2015F341
D. von Zeddelmann, F. Kurth, and M. Müller, "Perceptual audio features for unsupervised key-phrase detection," Proc. ICASSP2010, 2010, pp. 257-260, http://dx.doi.org/10.1109/ICASSP.2010.5495974.
S. Tabibian, A. Akbar, B. Nasersharif, "A fast search technique for discriminative keyword spotting," Artificial Intelligence and Signal Processing (AISP), 2012 16th CSI International Symposium on, pp.140-144, 2-3 May 2012, http://dx.doi.org/10.1109/AISP.2012.6313733.
M. Sigmund, “Search for Keywords and Vocal Elements in Audio Recordings”, Elektronika ir elektrotechnika, ISSN 1392-1215, vol. 19, no. 9, pp. 71-74, 2013
V. Mitra, J. van Hout, et. al., “Feature Fusion for High-Accuracy Keyword Spotting”, Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pp. 7143-7147 2014
J. Tejedor, D. T. Toledano, et al., “Access Spoken term detection ALBAYZIN 2014 evaluation: overview, systems, results, and discussion”, EURASIP Journal on Audio, Speech, and Music Processing (2015) 2015:21, http://dx.doi.org/10.1186/s13636-015-0063-8
A. S. Park and James R. Glass, (Cited in
) “Unsupervised pattern discovery in speech,” IEEE Trans. on Audio, Speech and Language Processing, vol. 16, no. 1, pp. 186–197, 2008.
M. D. Skowronski and J. G. Harris, “Exploiting independent filter bandwidth of human factor cepstral coefficients in automatic speech recognition,” The Journal of the Acoustical Society of America (JASA), vol. 116, no. 3, pp. 1774–1780, 2004.
G. Hunter, H. Kebede, Formant frequencies of British English vowels produced by native speakers of Farsi. Societe Francaise d’Acoustique, Acoustics 2012, Apr 2012, Nantes, France
D. Deterding (1997). The Formants of Monophthong Vowels in Standard Southern British English Pronunciation, Journal of the International Phonetic Association, 27, pp. 47-55
R. Snell, F. Milinazzo, Formant location from LPC analysis data, IEEE Transactions on Speech and Audio Processing. Vol. 1, Number 2, 1993, pp. 129-134.
J. Holmes, W. Holmes, P. Garner, Using formant frequencies in speech recognition, Eurospeech, Vol. 97, pp. 2083-2087, http://www.idiap.ch/~pgarner/pubs/holmes1997.pdf