Logo PTI
Polish Information Processing Society
Logo FedCSIS

Annals of Computer Science and Information Systems, Volume 18

Proceedings of the 2019 Federated Conference on Computer Science and Information Systems

Urban Sound Classification using Long Short-Term Memory Neural Network

, ,

DOI: http://dx.doi.org/10.15439/2019F185

Citation: Proceedings of the 2019 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 18, pages 5760 ()

Full text

Abstract. Environmental sound classification has received more attention in recent years. Analysis of environmental sounds is difficult because of its unstructured nature. However, the presence of strong spectro-temporal patterns makes the classification possible. Since LSTM neural networks are efficient at learning temporal dependencies we propose and examine a LSTM model for urban sound classification. The model is trained on magnitude mel-spectrograms extracted from UrbanSound8K dataset audio. The proposed network is evaluated using 5-fold cross-validation and compared with the baseline CNN. It is shown that the LSTM model outperforms a set of existing solutions and is more accurate and confident than the CNN.


  1. R. Radhakrishnan, A. Divakaran, and A. Smaragdis, “Audio analysis for surveillance applications,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005. IEEE, 2005, pp. 158–161. [Online]. Available: https://doi.org/10.1109/ASPAA.2005.1540194
  2. M. Cristani, M. Bicego, and V. Murino, “Audio-visual event recognition in surveillance video sequences,” IEEE Transactions on Multimedia, vol. 9, no. 2, pp. 257–267, 2007. [Online]. Available: https://doi.org/10.1109/TMM.2006.886263
  3. S. Chu, S. Narayanan, C.-C. J. Kuo, and M. J. Mataric, “Where am i? scene recognition for mobile robots using audio features,” in 2006 IEEE International conference on multimedia and expo. IEEE, 2006, pp. 885–888. [Online]. Available: https://doi.org/10.1109/ICME.2006.262661
  4. R. Bardeli, D. Wolff, F. Kurth, M. Koch, K.-H. Tauchert, and K.-H. Frommolt, “Detecting bird sounds in a complex acoustic environment and application to bioacoustic monitoring,” Pattern Recognition Letters, vol. 31, no. 12, pp. 1524–1534, 2010. [Online]. Available: https://doi.org/10.1016/j.patrec.2009.09.014
  5. C. Mydlarz, J. Salamon, and J. P. Bello, “The implementation of low-cost urban acoustic monitoring devices,” Applied Acoustics, vol. 117, pp. 207–218, 2017. [Online]. Available: https://doi.org/10.1016/j.apacoust.2016.06.010
  6. D. Steele, J. Krijnders, and C. Guastavino, “The sensor city initiative: cognitive sensors for soundscape transformations,” GIS Ostrava, pp. 1–8, 2013.
  7. V. Davidovski, “Exponential innovation through digital transformation,” in Proceedings of the 3rd International Conference on Applications in Information Technology. ACM, 2018, pp. 3–5. [Online]. Available: https://doi.org/10.1145/3274856.3274858
  8. F. Tappero, R. M. Alsina-Pagès, L. Duboc, and F. Alı́as, “Leveraging urban sounds: A commodity multi-microphone hardware approach for sound recognition,” in Multidisciplinary Digital Publishing Institute Proceedings, vol. 4, no. 1, 2019, p. 55. [Online]. Available: https://doi.org/10.3390/ecsa-5-05756
  9. E. Pyshkin, “Designing human-centric applications: Transdisciplinary connections with examples,” in 2017 3rd IEEE International Conference on Cybernetics (CYBCONF). IEEE, 2017, pp. 1–6. [Online]. Available: https://doi.org/10.1109/CYBConf.2017.7985774
  10. E. Pyshkin and A. Kuznetsov, “Approaches for web search user interfaces-how to improve the search quality for various types of information,” JoC, vol. 1, no. 1, pp. 1–8, 2010. [Online]. Available: https://www.earticle.net/Article/A188181
  11. M. B. Dias, “Navpal: Technology solutions for enhancing urban navigation for blind travelers,” tech. report CMU-RI-TR-21, Robotics Institute, Carnegie Mellon University, 2014.
  12. S. Chu, S. Narayanan, and C.-C. J. Kuo, “Environmental sound recognition with time–frequency audio features,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 6, pp. 1142–1158, 2009. [Online]. Available: https://doi.org/10.1109/TASL.2009.2017438
  13. S. Chachada and C.-C. J. Kuo, “Environmental sound recognition: A survey,” vol. 3, 10 2013, pp. 1–9. [Online]. Available: https://doi.org/10.1109/APSIPA.2013.6694338
  14. D. Giannoulis, E. Benetos, D. Stowell, M. Rossignol, M. Lagrange, and M. Plumbley, “Detection and classification of acoustic scenes and events: An ieee aasp challenge,” 10 2013, pp. 1–4. [Online]. Available: https://doi.org/10.1109/WASPAA.2013.6701819
  15. Z. Kons, O. Toledo-Ronen, and M. Carmel, “Audio event classification using deep neural networks.” in Interspeech, 2013, pp. 1482–1486.
  16. K. J. Piczak, “Environmental sound classification with convolutional neural networks,” in 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, 2015, pp. 1–6. [Online]. Available: https://doi.org/10.1109/MLSP.2015.7324337
  17. J. Salamon and J. P. Bello, “Deep convolutional neural networks and data augmentation for environmental sound classification,” IEEE Signal Processing Letters, vol. 24, no. 3, pp. 279–283, 2017. [Online]. Available: https://doi.org/10.1109/LSP.2017.2657381
  18. V. Boddapati, A. Petef, J. Rasmusson, and L. Lundberg, “Classifying environmental sounds using image recognition networks,” Procedia computer science, vol. 112, pp. 2048–2056, 2017. [Online]. Available: https://doi.org/10.1016/j.procs.2017.08.250
  19. B. Zhu, K. Xu, D. Wang, L. Zhang, B. Li, and Y. Peng, “Environmental sound classification based on multi-temporal resolution convolutional neural network combining with multi-level features,” in Pacific Rim Conference on Multimedia. Springer, 2018, pp. 528–537. [Online]. Available: https://doi.org/10.1007/978-3-030-00767-6 49
  20. Y. Wang, L. Neves, and F. Metze, “Audio-based multimedia event detection using deep recurrent neural networks,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 2742–2746. [Online]. Available: https://doi.org/10.1109/ICASSP.2016.7472176
  21. S. H. Bae, I. Choi, and N. S. Kim, “Acoustic scene classification using parallel combination of lstm and cnn,” in Proceedings of the Detec- tion and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), 2016, pp. 11–15.
  22. J. Sang, S. Park, and J. Lee, “Convolutional recurrent neural networks for urban sound classification using raw waveforms,” in 2018 26th European Signal Processing Conference (EUSIPCO). IEEE, 2018, pp. 2444–2448. [Online]. Available: https://doi.org/10.23919/EUSIPCO.2018.8553247
  23. A. Graves, S. Fernández, and J. Schmidhuber, “Bidirectional lstm networks for improved phoneme classification and recognition,” in International Conference on Artificial Neural Networks. Springer, 2005, pp. 799–804. [Online]. Available: https://doi.org/10.1007/11550907_126
  24. A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 2013, pp. 6645–6649. [Online]. Available: https://doi.org/10.1109/ICASSP.2013.6638947
  25. Y. Fan, Y. Qian, F.-L. Xie, and F. K. Soong, “Tts synthesis with bidirectional lstm based recurrent neural networks,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
  26. J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for video classification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4694–4702. [Online]. Available: https://doi.org/10.1109/CVPR.2015.7299101
  27. J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” in Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014, pp. 1041–1044. [Online]. Available: https://doi.org/10.1145/2647868.2655045
  28. J. Salamon and J. P. Bello, “Unsupervised feature learning for urban sound classification,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 171–175. [Online]. Available: https://doi.org/10.1109/ICASSP.2015.7177954