Speech sound detection employing deep learning

Cezary Polak; Jakub Mańkowski; Wiktor Uciński; Patryk Schramka; Mikołaj Mysiakowski; Adam Kurowski

Speech sound detection employing deep learning

Cezary Polak, Jakub Mańkowski, Wiktor Uciński, Patryk Schramka, Mikołaj Mysiakowski, Adam Kurowski

DOI: http://dx.doi.org/10.15439/2021F146

Citation: Position and Communication Papers of the 16th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 26, pages 221–222 (2021)

Full text

Abstract. The primary way of communication between people is speech, both in the form of everyday conversation and speech signal transmitted and recorded in numerous ways. The latter example is especially important in the modern days of the global SARS-CoV-2 pandemic when it is often not possible to meet with people and talk with them in person. Streaming, VoIP calls, live podcasts are just some of the many applications that have seen a significant increase in usage due to the necessity of social distancing. In our paper, we provide a method to design, develop, and test the deep learning-based algorithm capable of performing voice activity detection in a manner better than other benchmark solutions like the WebRTC VAD algorithm, which is an industry standard based mainly on a classic approach to speech signal processing.

References

H. Haneche, B. Boudraa, and A. Ouahabi, “A new way to enhance speech signal based on compressed sensing,” Measurement, vol. 151, p. 107117, 2020. http://dx.doi.org/https://doi.org/10.1016/j.measurement.2019.107117. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0263224119309832
K. Paciorek. Andrzej Duda o: LGBT, TVP, koronawirusie, głosach po Bosaku i o szansach w starciu z Trzaskowskim (in Polish). Youtube (Imponderabilia channel). [Online]. Available: https://www.youtube.com/watch?v=Izxj72bg4A4
Freesound. Party Sounds recording from the online royalty free recordings archive. [Online]. Available: https://freesound.org/people/FreqMan/sounds/23153/
B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and music signal analysis in python,” in Proceedings of the 14th python in science conference, vol. 8, 2015.
GitHub. Python interface to the WebRTC voice activity detector. [Online]. Available: https://github.com/wiseman/py-webrtcvad
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro et al., “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015. [Online]. Available: https://www.tensorflow.org/