Urban Sound Classification using Long Short-Term Memory Neural Network
Iurii Lezhenin, Natalia Bogach, Evgeny Pyshkin
DOI: http://dx.doi.org/10.15439/2019F185
Citation: Proceedings of the 2019 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 18, pages 57–60 (2019)
Abstract. Environmental sound classification has received more attention in recent years. Analysis of environmental sounds is difficult because of its unstructured nature. However, the presence of strong spectro-temporal patterns makes the classification possible. Since LSTM neural networks are efficient at learning temporal dependencies we propose and examine a LSTM model for urban sound classification. The model is trained on magnitude mel-spectrograms extracted from UrbanSound8K dataset audio. The proposed network is evaluated using 5-fold cross-validation and compared with the baseline CNN. It is shown that the LSTM model outperforms a set of existing solutions and is more accurate and confident than the CNN.
References
- R. Radhakrishnan, A. Divakaran, and A. Smaragdis, “Audio analysis for surveillance applications,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005. IEEE, 2005, pp. 158–161. [Online]. Available: https://doi.org/10.1109/ASPAA.2005.1540194
- M. Cristani, M. Bicego, and V. Murino, “Audio-visual event recognition in surveillance video sequences,” IEEE Transactions on Multimedia, vol. 9, no. 2, pp. 257–267, 2007. [Online]. Available: https://doi.org/10.1109/TMM.2006.886263
- S. Chu, S. Narayanan, C.-C. J. Kuo, and M. J. Mataric, “Where am i? scene recognition for mobile robots using audio features,” in 2006 IEEE International conference on multimedia and expo. IEEE, 2006, pp. 885–888. [Online]. Available: https://doi.org/10.1109/ICME.2006.262661
- R. Bardeli, D. Wolff, F. Kurth, M. Koch, K.-H. Tauchert, and K.-H. Frommolt, “Detecting bird sounds in a complex acoustic environment and application to bioacoustic monitoring,” Pattern Recognition Letters, vol. 31, no. 12, pp. 1524–1534, 2010. [Online]. Available: https://doi.org/10.1016/j.patrec.2009.09.014
- C. Mydlarz, J. Salamon, and J. P. Bello, “The implementation of low-cost urban acoustic monitoring devices,” Applied Acoustics, vol. 117, pp. 207–218, 2017. [Online]. Available: https://doi.org/10.1016/j.apacoust.2016.06.010
- D. Steele, J. Krijnders, and C. Guastavino, “The sensor city initiative: cognitive sensors for soundscape transformations,” GIS Ostrava, pp. 1–8, 2013.
- V. Davidovski, “Exponential innovation through digital transformation,” in Proceedings of the 3rd International Conference on Applications in Information Technology. ACM, 2018, pp. 3–5. [Online]. Available: https://doi.org/10.1145/3274856.3274858
- F. Tappero, R. M. Alsina-Pagès, L. Duboc, and F. Alı́as, “Leveraging urban sounds: A commodity multi-microphone hardware approach for sound recognition,” in Multidisciplinary Digital Publishing Institute Proceedings, vol. 4, no. 1, 2019, p. 55. [Online]. Available: https://doi.org/10.3390/ecsa-5-05756
- E. Pyshkin, “Designing human-centric applications: Transdisciplinary connections with examples,” in 2017 3rd IEEE International Conference on Cybernetics (CYBCONF). IEEE, 2017, pp. 1–6. [Online]. Available: https://doi.org/10.1109/CYBConf.2017.7985774
- E. Pyshkin and A. Kuznetsov, “Approaches for web search user interfaces-how to improve the search quality for various types of information,” JoC, vol. 1, no. 1, pp. 1–8, 2010. [Online]. Available: https://www.earticle.net/Article/A188181
- M. B. Dias, “Navpal: Technology solutions for enhancing urban navigation for blind travelers,” tech. report CMU-RI-TR-21, Robotics Institute, Carnegie Mellon University, 2014.
- S. Chu, S. Narayanan, and C.-C. J. Kuo, “Environmental sound recognition with time–frequency audio features,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 6, pp. 1142–1158, 2009. [Online]. Available: https://doi.org/10.1109/TASL.2009.2017438
- S. Chachada and C.-C. J. Kuo, “Environmental sound recognition: A survey,” vol. 3, 10 2013, pp. 1–9. [Online]. Available: https://doi.org/10.1109/APSIPA.2013.6694338
- D. Giannoulis, E. Benetos, D. Stowell, M. Rossignol, M. Lagrange, and M. Plumbley, “Detection and classification of acoustic scenes and events: An ieee aasp challenge,” 10 2013, pp. 1–4. [Online]. Available: https://doi.org/10.1109/WASPAA.2013.6701819
- Z. Kons, O. Toledo-Ronen, and M. Carmel, “Audio event classification using deep neural networks.” in Interspeech, 2013, pp. 1482–1486.
- K. J. Piczak, “Environmental sound classification with convolutional neural networks,” in 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, 2015, pp. 1–6. [Online]. Available: https://doi.org/10.1109/MLSP.2015.7324337
- J. Salamon and J. P. Bello, “Deep convolutional neural networks and data augmentation for environmental sound classification,” IEEE Signal Processing Letters, vol. 24, no. 3, pp. 279–283, 2017. [Online]. Available: https://doi.org/10.1109/LSP.2017.2657381
- V. Boddapati, A. Petef, J. Rasmusson, and L. Lundberg, “Classifying environmental sounds using image recognition networks,” Procedia computer science, vol. 112, pp. 2048–2056, 2017. [Online]. Available: https://doi.org/10.1016/j.procs.2017.08.250
- B. Zhu, K. Xu, D. Wang, L. Zhang, B. Li, and Y. Peng, “Environmental sound classification based on multi-temporal resolution convolutional neural network combining with multi-level features,” in Pacific Rim Conference on Multimedia. Springer, 2018, pp. 528–537. [Online]. Available: https://doi.org/10.1007/978-3-030-00767-6 49
- Y. Wang, L. Neves, and F. Metze, “Audio-based multimedia event detection using deep recurrent neural networks,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 2742–2746. [Online]. Available: https://doi.org/10.1109/ICASSP.2016.7472176
- S. H. Bae, I. Choi, and N. S. Kim, “Acoustic scene classification using parallel combination of lstm and cnn,” in Proceedings of the Detec- tion and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), 2016, pp. 11–15.
- J. Sang, S. Park, and J. Lee, “Convolutional recurrent neural networks for urban sound classification using raw waveforms,” in 2018 26th European Signal Processing Conference (EUSIPCO). IEEE, 2018, pp. 2444–2448. [Online]. Available: https://doi.org/10.23919/EUSIPCO.2018.8553247
- A. Graves, S. Fernández, and J. Schmidhuber, “Bidirectional lstm networks for improved phoneme classification and recognition,” in International Conference on Artificial Neural Networks. Springer, 2005, pp. 799–804. [Online]. Available: https://doi.org/10.1007/11550907_126
- A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 2013, pp. 6645–6649. [Online]. Available: https://doi.org/10.1109/ICASSP.2013.6638947
- Y. Fan, Y. Qian, F.-L. Xie, and F. K. Soong, “Tts synthesis with bidirectional lstm based recurrent neural networks,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
- J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for video classification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4694–4702. [Online]. Available: https://doi.org/10.1109/CVPR.2015.7299101
- J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” in Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014, pp. 1041–1044. [Online]. Available: https://doi.org/10.1145/2647868.2655045
- J. Salamon and J. P. Bello, “Unsupervised feature learning for urban sound classification,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 171–175. [Online]. Available: https://doi.org/10.1109/ICASSP.2015.7177954