Gender-aware speaker's emotion recognition based on 1-D and 2-D features

Włodzimierz Kasprzak; Mateusz Hryciów

Gender-aware speaker's emotion recognition based on 1-D and 2-D features

Włodzimierz Kasprzak, Mateusz Hryciów

DOI: http://dx.doi.org/10.15439/2023F4485

Citation: Proceedings of the 18th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 35, pages 1011–1015 (2023)

Full text

Abstract. An approach to speaker's emotion recognition based on several acoustic feature types and 1D convolutional neural networks is described. The focus is on selecting the best speech features, improving the baseline model configuration and integrating in the solution a gender classification network. Features include a Mel-scale spectrogram and MFCC- , Chroma-, prosodic- and pitch-related features. Especially, the question whether to use 2-D maps of features or reduce them to 1-D vectors by averaging, is experimentally resolved. Well--known speech datasets RAVDESS, Tess, Crema-D and Savee are used in experiments. It appeared, that the best performing model consists of two convolutional networks for gender-aware classification and one gender classifier. The Chroma features have been found to be obsolete, and even disturbing, given other speech features. The f1 accuracy of proposed solution reached 73.2\\% on the RAVDESS dataset and 66.5\\% on all four datasets combined, improving the baseline model by 7.8\\% and 3\\%, respectively. This approach is an alternative to other proposed models, which reported accuracy scores of 60\\% - 71\\% on the RAVDESS dataset.

References

M. Lech, M. Stolar, C. Best and R. Bolia, “Real-Time Speech Emotion Recognition Using a Pre-trained Image Classification Network: Effects of Bandwidth Reduction and Companding,” Frontiers in Computer Science, vol. 2, 2020, article 14, https://doi.org/10.3389/fcomp.2020.00014.
E. Andre, M. Rehm, W. Minker and D. Buthler, “Endowing spoken language dialogue systems with emotional intelligence,” in Affective Dialogue Systems. ADS 2004, Lecture Notes in Computer Science, vol. 3068, 2004, pp. 178–187, Springer, Berlin, Heidelberg, https://doi.org/10.1007/978-3-540-24842-2_17.
C. Guo, K. Zhang, J. Chen, R. Xu and L.Gao, “Design and application of facial expression analysis system in empathy ability of children with autism spectrum disorder”, Proceedings of the 16th Conference on Computer Science and Intelligence Systems, Annals of Computer Science and Information Systems, vol. 25, 2021, pp. 319–325, http://dx.doi.org/10.15439/2021F91.
S. Gu, F. Wang, N. P. Patel, J. A. Bourgeois and J. H. Huang, „A Model for Basic Emotions Using Observations of Behavior in Drosophila,” Frontiers in Psychology, vol. 10, 2019, article 781, https://doir.org/10.3389/fpsyg.2019.00781.
J. A. Russel, “Emotions are not modules,” Canadian Journal of Philosophy, vol. 36, 2006, sup1, pp. 53–71, Routledge Publ. https://doi.org/10.1353/cjp.2007.0037.
E.Y. Bann, “Discovering Basic Emotion Sets via Semantic Clustering on a Twitter Corpus,” https://arxiv.org/abs/1212.6527 [cs.AI], December 2012, https://doi.org/10.48550/arXiv.1212.6527.
S. Lugovic, I. Dunder and M. Horvat, „Techniques and Applications of Emotion Recognition in Speech,” in 39th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), 2016, pp. 1278–1283, https://doi.org/10.1109/MIPRO.2016.7522336.
U. Kamath, J. Liu and J. Whitaker, Deep Learning for NLP and Speech Recognition, Springer Nature Switzerland AG, Cham, 2019. https://doi.org/10.1007/978-3-030-14596-5.
N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel and P. Ouellet, “Front-End Factor Analysis for Speaker Verification”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, 2011, no. 4, pp. 788–798. http://dx.doi.org/10.1109/TASL.2010.2064307.
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey and S. Khudanpur, ”X-Vectors: Robust DNN Embeddings for Speaker Recognition”, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5329–5333, http://dx.doi.org/10.1109/ICASSP.2018.8461375.
B. J. Abbaschian, D. Sierra-Sosa and A. Elmaghraby, „Deep Learning Techniques for Speech Emotion Recognition, from Databases to Models,” Sensors, mdpi, 2021, 21(4), https://doi.org/10.3390/s21041249.
B. Schuller, S. Reiter, R. Muller, M. Al-Hames, M. Lang and G. Rigoll, “Speaker independent speech emotion recognition by ensemble classification,” in IEEE International Conference on Multimedia and Expo, 2005, https://doi.org/10.1109/icme.2005.1521560.
J. Rong, G. Li and Y.-P. P. Chen, „Acoustic feature selection for automatic emotion recognition from speech,” Information Processing and Management, Elsevier, vol. 45, 2009, issue 3, pp. 315–328, https://doi.org/10.1016/j.ipm.2008.09.003.
M. B. Akcay and K. Oguz, “Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers,” Speech Communication, Elsevier, vol. 116, January 2020, pp. 56–76, https://doi.org/10.1016/j.specom.2019.12.001.
M. B. Er and I. B. Aydilek, “Music Emotion Recognition by Using Chroma Spectrogram and Deep Visual Features,” International Journal of Computational Intelligence Systems, vol. 12, Issue 2, 2019, pp. 1622–1634, https://doi.org/10.2991/ijcis.d.191216.001.
S. R. Livingstone and F. A. Russo, “The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English,” PLoS ONE, 13(5): e0196391, 2018, https://doi.org/10.1371/journal.pone.0196391.
K. Dupuis and K. M. Pichora-Fuller, Toronto emotional speech set (TESS), University of Toronto, Psychology Department, Borealis data publisher, 2010, https://doi.org/10.5683/SP2/E8H2MF.
S. Haq and P.J.B. Jackson, “Multimodal Emotion Recognition”, in W. Wang (ed), Machine Audition: Principles, Algorithms and Systems, IGI Global Press, 2011, chapter 17, pp. 398–423, https://doi.org/10.4018/978-1-61520-919-4.
LibRosa documentation, https://librosa.org/doc/.
S. Burnval, Speech emotion recognition, (https://www.kaggle.com/shivamburnwal/speech-emotion-recognition)
M. A. Kutlugün, Y. Sirin and M. A. Karakaya, “The Effects of Augmented Training Dataset on Performance of Convolutional Neural Networks in Face Recognition System”, Proceedings of the 2019 Federated Conference on Computer Science and Information Systems, Annals of Computer Science and Information Systems, vol. 18, 2019, pp. 929–932, http://dx.doi.org/10.15439/2019F181.
P. Shegokar and P. Sircar, “Continuous wavelet transform based speech emotion recognition”, in 2016 10th International Conference on Signal Processing and Communication Systems (ICSPCS), Dec. 2016, IEEE, pp. 1–8. https://doi.org/10.1109/ICSPCS.2016.7843306.
Y. Zeng, H. Mao, D. Peng and Z. Yi, “Spectrogram based multi-task audio classification”, Multimedia Tools and Applications, vol. 78 (2019), no. 3, pp. 3705–3722, https://doi.org/10.1007/s11042-017-5539-3.
A. S. Popova, A. Rassadin and A. Ponomarenko, “Emotion Recognition in Sound”, International Conference on Neuroinformatics, vol. 736, 2018, pp. 117–124, 2018, http://dx.doi.org/10.1007/978-3-319-66604-4_18.
D. Issa, F. M. Demirci and A. Yazici, “Speech emotion recognition with deep convolutional neural networks”, Biomedical Signal Processing and Control, Elsevier, vol. 59, 2020, 101894, http://dx.doi.org/10.1016/j.bspc.2020.101894.