Subvocal Speech Recognition via Close-Talk Microphone and Surface Electromyogram Using Deep Learning

Mohamed S. Elmahdy; Ahmed Morsy

Subvocal Speech Recognition via Close-Talk Microphone and Surface Electromyogram Using Deep Learning

Mohamed S. Elmahdy, Ahmed Morsy

DOI: http://dx.doi.org/10.15439/2017F153

Citation: Proceedings of the 2017 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 11, pages 165–168 (2017)

Full text

Abstract. Speech communication is very essential for human-human communication and human machine interaction. Current Automatic Speech Recognition (ASR) may not be suitable for quiet settings like libraries and meetings or for speech handicapped and elderly people. In this study, we present an end-to-end deep learning system for subvocal speech recognition. The proposed system utilizes a single channel surface Electromyogram (sEMG) placed diagonally across the throat alongside a close-talk microphone. The system was tested on a corpus of 20 words. The system was capable of learning the mapping functions from sound and sEMG sequences to letters and then extracting the most probable word formed by these letters. We investigated different input signals and different depth levels for the deep learning model. The proposed system achieved a Word Error Rate (WER) of 9.44, 8.44 and 9.22 for speech, speech combined with single channel sEMG, and speech with two channels of sEMG, respectively.

References

S. Tomko, T. K. Harris, A. Toth, J. Sanders, A. Rudnicky, and R. Rosenfeld, “Towards efficient human machine speech communication,” ACM Trans. Speech Lang. Process., vol. 2, no. 1, pp. 1–27, Feb. 2005.
W. Xiong et al., “Achieving Human Parity in Conversational Speech Recognition,” 2016.
F. Aman, M. Vacher, S. Rossato, and F. Portet, “Analysing the Performance of Automatic Speech Recognition for Ageing Voice: Does it Correlate with Dependency Level?,” pp. 9–15, 2013.
S. Jou, S. Jou, T. Schultz, M. Walliczek, F. Kraft, and A. Waibel, “Waibel A: Towards continuous speech recognition using surface electromyography,” Proc. INTERSPEECH - ICSLP, pp. 17--21.
M. Wand, M. Janke, and T. Schultz, “Tackling Speaking Mode Varieties in EMG-Based Speech Recognition,” IEEE Trans. Biomed. Eng., vol. 61, no. 10, pp. 2515–2526, Oct. 2014.
L. E. Mendoza, J. Peña, and J. L. Ramón Valencia, “Electro-myographic patterns of sub-vocal Speech: Records and classification,” Rev. Tecnol., vol. 12, no. 2, Dec. 2015.
M. Wand, C. Schulte, M. Janke, and T. Schultz, “Array-based Electromyographic Silent Speech Interface,” in Proceedings of the International Conference on Bio-inspired Systems and Signal Processing, 2013, pp. 89–96.
Y. Deng, G. Colby, J. T. Heaton, and G. S. Meltzner, “Signal processing advances for the MUTE sEMG-based silent speech recognition system,” in MILCOM 2012 - 2012 IEEE Military Communications Conference, 2012, pp. 1–6.
Panikos Heracleous, Yoshitaka Nakajima, et al. “audible (normal) speech and inaudible murmur recognition using nam microphone”, Signal Processing Conference, 2004.
L. Yujian and L. Bo, “A Normalized Levenshtein Distance Metric,” IEEE Trans. Pattern Anal., vol.29, pp. 1091–1095, Jun. 2007.
“Audacity® | Free, open source, cross-platform audio software for multi-track recording and editing.” [Online]. Available: http://www.audacityteam.org/. [Accessed: 08-May-2017].
D. P. Robinson and J. Tingay, “Comparative study of the performance of smartphone-based sound level meter apps, with and without the application of a 1⁄2 " IEC-61094-4 working standard microphone, to IEC-61672 standard metering equipment in the detection of various problematic workplace noise environments.”
J.-P. Haton, “Automatic Speech Recognition: A Review,” in Enterprise Information Systems V, Dordrecht: Kluwer Academic Publishers, 2004, pp. 6–11.
D. Amodei et al., “Deep Speech 2: End-to-End Speech Recognition in English and Mandarin,” Dec. 2015.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Proceedings of the 25th International Conference on Neural Information Processing Systems. Curran Associates Inc., pp. 1097–1105, 2012.
Y. Zhang, M. Pezeshki, P. Brakel, S. Zhang, C. L. Y. Bengio, and A. Courville, “Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks,” Jan. 2017.
A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification,” in Proceedings of the 23rd international conference on Machine learning - ICML ’06, 2006, pp. 369–376.