Acoustic Model Training, using Kaldi, for Automatic Whispery Speech Recognition

Piotr Kozierski; Talar Sadalla; Szymon Drgas; Adam Dąbrowski; Joanna Ziętkiewicz; Wojciech Giernacki

Acoustic Model Training, using Kaldi, for Automatic Whispery Speech Recognition

Piotr Kozierski, Talar Sadalla, Szymon Drgas, Adam Dąbrowski, Joanna Ziętkiewicz, Wojciech Giernacki

DOI: http://dx.doi.org/10.15439/2018F255

Citation: Position Papers of the 2018 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 16, pages 109–114 (2018)

Full text

Abstract. The article presents research on the automatic whispery speech recognition. The main task was to find dependences between a number of triphone classes (number of leaves in decision tree) and the total number of Gaussian distributions and therefore, to determine optimal values, for which the quality of speech recognition is best. Moreover, it was found, how these dependences differ between normal and whispery speech, what was not done earlier, and this is the innovative part of this work. Based on the performed experiments and obtained results one can say that the number of triphone classes (number of leaves) for whispered speech should be significantly lower than for normal speech.

References

H. R. Sharifzadeh, I. V. McLoughlin, and F. Ahmadi, “Reconstruction of normal sounding speech for laryngectomy patients through a modified CELP codec,” Biomedical Engineering, IEEE Transactions on, vol. 57, no. 10, pp. 2448–2458, 2010.
H. F. Nijdam, A. A. Annyas, H. K. Schutte, and H. Leever, “A new prosthesis for voice rehabilitation after laryngectomy,” Archives of Otorhinolaryngology, vol. 237, no. 1, pp. 27–33, 1982.
X. Huang, A. Acero, F. Alleva, M. Y. Hwang, L. Jiang, and M. Mahajan, “Microsoft Windows highly intelligent speech recognizer: Whisper,” in Acoustics, Speech, and Signal Processing, 1995 International Conference on (ICASSP-95), vol. 1, pp. 93–96.
T. J. Raitio, M. J. Hunt, H. B. Richards, and M. Chinthakunta, “Digital assistant providing whispered speech,” U.S. Patent 15/266,932, December 14, 2017.
D. T. Williamson, M. H. Draper, G. L. Calhoun, and T. P. Barry, “Commercial speech recognition technology in the military domain: Results of two recent research efforts,” International Journal of Speech Technology, vol. 8, no. 1, pp. 9–16, 2005.
S. Pigeon, C. Swail, E. Geoffrois, G. Bruckner, D. Van Leeuwen, C. Teixeira, et al., Use of speech and language technology in military environments, Montreal, Canada, North Atlantic Treaty Organization, 2005.
S. C. S. Jou, T. Schultz, and A. Waibel, “Whispery speech recognition using adapted articulatory features,” in ICASSP, March 2005, pp. 1009–1012.
Q. Jin, S. C. S. Jou, and T. Schultz, “Whispering speaker identification,” in Multimedia and Expo, 2007 IEEE International Conference on, pp. 1027–1030.
M. Akamine, and J. Ajmera, “Decision tree-based acoustic models for speech recognition,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2012, art. no. 10, p. 8, 2012.
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, et al., “The Kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding, No. EPFL-CONF-192584, 2011.
C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, and M. Mohri, “OpenFst: A general and efficient weighted finite-state transducer library,” in Implementation and Application of Automata, J. Holub and J. Ždárek, Ed. Berlin: Springer Heidelberg, 2007, pp. 11–23.
O. Platek, “Speech recognition using KALDI,” M.S. thesis, Inst. Form. Appl. Ling., Charles Univ., Prague, Czech Republic, 2014.
A. Stolcke, “SRILM-an extensible language modeling toolkit,” in Proc. Intl. Conf. Spoken Language Processing (INTERSPEECH), Denver, Colorado, September 2002, pp. 901–904.
M. Bisani, and H. Ney, “Joint-sequence models for grapheme-to-phoneme conversion,” Speech Communication, vol. 50, no. 5, pp. 434–451, 2008.
I. H. Witten, and T. C. Bell, “The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression,” IEEE Transactions on Information Theory, vol. 37, no. 4, pp. 1085–1094, 1991.
G. Demenko, M. Wypych, and E. Baranowska, “Implementation of grapheme-to-phoneme rules and extended SAMPA alphabet in Polish text-to-speech synthesis,” Speech and Language Technology, vol. 7, pp. 79–97, 2003.
M. Wypych, E. Baranowska, and G. Demenko, “A grapheme-to-phoneme transcription algorithm based on the SAMPA alphabet extension for the Polish language,” in Phonetic Sciences, 15th International Congress of (ICPhS), Barcelona, August 2003, pp. 2601–2604.
P. Kłosowski, “Improving speech processing based on phonetics and phonology of Polish language,” Przeglad Elektrotechniczny, vol. 89, no. 8, pp. 303–307, 2013.
A. Karpov, K. Markov, I. Kipyatkova, D. Vazhenina, and A. Ronzhin, “Large vocabulary Russian speech recognition using syntactico-statistical language modeling,” Speech Communication, vol. 56, pp. 213–228, 2014.
P. Kozierski, T. Sadalla, S. Drgas, A. Dąbrowski, “Allophones in automatic whispery speech recognition,” in Methods and Models in Automation and Robotics (MMAR), 21st International Conference on, 2016, pp. 811-815. http://dx.doi.org/10.1109/MMAR.2016.7575241
F. Portet, M. Vacher, C. Golanski, C. Roux, and B. Meillon, “Design and evaluation of a smart home voice interface for the elderly: Acceptability and objection aspects,” Personal and Ubiquitous Computing, vol. 17, no. 1, pp. 127–144, 2013.
K. Szostek, “Optimization of HMM models and their usage in speech recognition (in Polish),” Elektrotechnika i Elektronika, vol. 24, no. 2, pp. 172–182, 2005.
B. Lewandowska-Tomaszczyk, M. Bańko, R. L. Górski, P. Pęzik, and A. Przepiórkowski, National corpus of Polish language (in Polish), Warszawa: Wydawnictwo Naukowe PWN, 2012.
F. Cummins, M. Grimaldi, T. Leonard, and J. Simko, “The Chains corpus: Characterizing individual speakers,” in Proc. of SPECOM, vol. 6, 2006, pp. 431–435.
T. Tran, S. Mariooryad, and C. Busso, “Audiovisual corpus to analyze whisper speech,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, May 2013, pp. 8101–8105.
T. Ito, K. Takeda, and F. Itakura, “Analysis and recognition of whispered speech,” Speech Communication, vol. 45, no. 2, pp. 139–152, 2005.
C. Huang, E. Chang, J. Zhou, K. and F. Lee, “Accent modeling based on pronunciation dictionary adaptation for large vocabulary Mandarin speech recognition,” in INTERSPEECH, October 2000, pp. 818–821.
P. Kozierski, T. Sadalla, S. Drgas, A. Dąbrowski, and J. Zietkiewicz, “The impact of vocabulary size and language model order on the Polish whispery speech recognition,” in Methods and Models in Automation and Robotics (MMAR), 22nd International Conference on, 2017, pp. 616–621. http://dx.doi.org/10.1109/MMAR.2017.8046899
L. Besacier, E. Barnard, A. Karpov, and T. Schultz, “Automatic speech recognition for under-resourced languages: A survey,” Speech Communication, vol 56, pp. 85–100, 2014.
P. Kozierski, T. Sadalla, S. Drgas, A. Dąbrowski, and D. Horla, “Kaldi toolkit in Polish whispery speech recognition,” Przeglad Elektrotechniczny, vol. 92, no. 11, pp. 301–304, 2016. http://dx.doi.org/10.15199/48.2016.11.70