Automatic speaker's age classification in the Common Voice database

Adam Nowakowski; Włodzimierz Kasprzak

Automatic speaker's age classification in the Common Voice database

Adam Nowakowski, Włodzimierz Kasprzak

DOI: http://dx.doi.org/10.15439/2023F2483

Citation: Proceedings of the 18th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 35, pages 1087–1091 (2023)

Full text

Abstract. An approach to speaker's age classification using deep neural networks is described. Preliminary signal features are extracted, based on mel-frequency cepstral coefficients (MFCC). For gender classification, an MLP network appears to be a satisfactory lightweight solution. For the age modelling and classification problem, two network types, ResNet34 and x-vectors, were tested and compared. The impact of signal processing parameters and gender information onto the classification performance was studied. The neural networks were trained and verified on a large ''Common Voice'' dataset of English speech recordings.

References

E. D.Mysak and T. Hanley, “Aging processes in speech: Pitch and duration characteristics”, Journal of Gerontology, vol. 13, 1958, no. 3, pp. 309–313, https://doi.org/10.1093/GERONJ/13.3.309.
N. Minematsu, M. Sekiguchi and K. Hirose, “Automatic estimation of one’s age with his/her speech based upon acoustic modeling techniques of speakers”, Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), vol. 1, 2002, pp. 137–140, https://doi.org/10.1109/ICASSP.2002.5743673.
C. Müller, F. Wittig and J. Baus, “Exploiting speech for recognizing elderly users to respond to their special needs”, Interspeech, Proc. 8th Eur. Conf. Speech Commun. Technol., 2003, pp. 1305–1308, https://doi.org/10.21437/Eurospeech.2003-413.
U. Kamath, J. Liu and J. Whitaker, Deep Learning for NLP and Speech Recognition, Springer Nature Switzerland AG, Cham, 2019, https://doi.org/10.1007/978-3-030-14596-5.
P. G. Shivakumar, M. Li, V. Dhandhania and S. S. Narayanan, “Simplified and supervised i-vector modeling for speaker age regression”, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 4833–4837, https://doi.org/10.1109/ICASSP.2014.6854520.
M. H. Bahari, M. McLaren, H. Van Hamme and D. A. van Leeuwen, ”Speaker age estimation using i-vectors”, Engineering Applications of Artificial Intelligence, vol. 34, 2014, pp. 99–108, https://doi.org/10.1016/j.engappai.2014.05.003.
N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel and P. Ouellet, “Front-End Factor Analysis for Speaker Verification”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, 2011, no. 4, pp. 788–798, https://doi.org/10.1109/TASL.2010.2064307.
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey and S. Khudanpur, “X-Vectors: Robust DNN Embeddings for Speaker Recognition”, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5329–5333. https://doi.org/10.1109/ICASSP.2018.8461375.
B. Gu, W. Guo, L. Dai and J. Du, “An Adaptive X-vector Model for Text-independent Speaker Verification”, 2020. https://doi.org/10.48550/ARXIV.2002.06049.
L. Zhou, M.Wang, Y. Qian, H. Luo, H. Li and X. Lin, “Text-independent Speaker Recognition Based on X-vector”, 2022 7th International Conference on Signal and Image Processing (ICSIP), 2022, pp. 121–125. https://doi.org/10.1109/ICSIP55141.2022.9887021.
R. Zazo, P. Sankar Nidadavolu, N. Chen, J. Gonzalez-Rodriguez and N. Dehak, “Age Estimation in Short Speech Utterances Based on LSTM Recurrent Neural Networks”, IEEE Access, vol. 6, pp. 22524–22530, 2018, https://doi.org/10.1109/ACCESS.2018.2816163.
A. I. Mansour and S. S. Abu-Naser, “Classification of Age and Gender Using Resnet - Deep Learning”, International Journal of Academic Engineering Research (IJAER), vol. 6, 2022, no. 8, 20–29, https://philpapers.org/rec/MANCOA-4/.
R. Ardila, M. Branson, K. Davis et al., “Common Voice: A Massively-Multilingual Speech Corpus”, Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), 2020, pp. 4218–4222, https://aclanthology.org/2020.lrec-1.520/.
N. Tawara, A. Ogawa, Y. Kitagishi and H. Kamiyama, “Age-VOX-Celeb: Multi-Modal Corpus for Facial and Speech Estimation”, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6963–6967, https://doi.org/10.1109/ICASSP39728.2021.9414272.
LibRosa, “Audio and music processing in Python”, https://librosa.org/
C. Li, X.Ma, B. Jiang et al., Deep Speaker: an End-to-End Neural Speaker Embedding System, May 2017. https://arxiv.org/abs/1705.02304 [cs.CL] (or https://arxiv.org/abs/1705.02304v1 [cs.CL] ) https://doi.org/10.48550/arXiv.1705.02304.
S. Hourri and J. Kharroubi, “A deep learning approach for speaker recognition”, International Journal of Speech Technology, vol. 23, 2020, pp. 123–131, https://doi.org/10.1007/s10772-019-09665-y.