Logo PTI Logo FedCSIS

Position Papers of the 20th Conference on Computer Science and Intelligence Systems

Annals of Computer Science and Information Systems, Volume 44

Enhancing Arabic ASR in Noisy and Transcoding EVS Conditions: A Multimodal Deep Learning Study

, ,

DOI: http://dx.doi.org/10.15439/2025F2032

Citation: Position Papers of the 20th Conference on Computer Science and Intelligence Systems, M. Bolanowski, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 44, pages 18 ()

Full text

Abstract. In this paper , we investigate the impact of speech transcoding and noise on the performance of Arabic automatic speech recognition (ASR) systems based on deep learning. We apply Non negative Matrix Factorization (NMF) as a denoising preproces sing step to enhance robustness to noise. Three deep architectures CNN LSTM, LSTM, and DNN are evaluated using fused acoustic features including MFCCs, mel spectrograms, and Gabor filter representations. Experiments are conducted under four signal to noise ratio (SNR) conditions (−5 dB, 0 dB, 5 dB, and 10 dB) on both transcoded and non transcoded speech. Results show that the CNN LSTM model achieves the highest accuracy of 87\% at 10 dB SNR on clean (non transcoded) speech using multimodal features. However, speech recognition performance degrades by 2 4\% when using the Enhanced Voice Services (EVS) codec, especially in high noise environments. Specifically, accuracy drops from 65.00\% to 61.43\% at −5 dB SNR, and from 87.00\% to 84.00\% at 10 dB SNR due to trans coding. These findings highlight the negative impact of mobile codec compression on ASR systems, particularly under low SNR conditions. Our study confirms the effectiveness and stability of NMF based feature fusion and denoising in improving recognition, o ffering insights into deploying Arabic ASR in real world scenarios such as mobile and VoIP communications

References

  1. C. Aggarwal, D. Olshefski, D. Saha, Zon-Yin Shae and P. Yu, "CSR. (2005): Speaker Recognition from Compressed VoIP Packet Stream,". IEEE International Conference on Multimedia and Expo, Amsterdam, Netherlands,, pp. 970-973.
  2. Drude, L., Heymann, J., Schwarz, A., & Valin, J. M. (2021). Multichannel Opus compression for far-field automatic speech recognition with a fixed bitrate budget. arXiv preprint https://arxiv.org/abs/2106.07994.
  3. Dong, P., Wang, S., Niu, W., Zhang, C., Lin, S., Li, Z., ... & Tao, D. (2020). Rtmobile: Beyond real-time mobile acceleration of rnns for speech recognition. In 2020 57th ACM/IEEE Design Automation Conference (DAC) (pp. 1-6). IEEE
  4. Amrouche, A., Debyeche, M., Taleb Ahmed, A., Rouvaen, J. M., & Yagoub, M. C. E. (2010). Efficient system for speech recognition in adverse conditions using nonparametric regression. Engineering Applications of Artificial Intelligence, 23(1), 85–94.
  5. Ryumin, D., Ivanko, D., & Ryumina, E. (2023). Audio-visual speech and gesture recognition by sensors of mobile devices. Sensors, 23(4), 2284.,
  6. Bouchakour, L., & Debyeche, M. (2022). Noise-robust speech recognition in mobile network based on convolution neural networks. International Journal of Speech Technology, 25(1), 269-277.
  7. Bouchakour, L., Debyeche, M., & Krobba, A. (2024). Robust Features in Deep Neural Networks for Transcoded Speech Recognition DSR and AMR-NB. In 8th International Conference on Image and Signal Processing and their Applications (ISPA) (pp. 1-5). IEEE..
  8. M. Schmidt and R. Olsson, (2006).“Single-channel speech separation using sparse non-negative matrix factorization,” in Proc. Interspeech, pp. 3111–3119.
  9. R. J. Weiss and D. P. Ellis, (2006). “Estimating single-channel source separation masks: Relevance vector machine classifiers vs. pitch-based masking,” in Proc. SAPA,, pp. 31–36.
  10. Rohlfing, C., Becker, J. M., & Wien, M. (2016,). NMF-based informed source separation. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 474-478). IEEE.
  11. Tuomas Virtanen, “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 3, pp. 1066–1074, 2007.
  12. Abdel-Hamid, O., Mohamed, A. R., Jiang, H., Deng, L., Penn, G., & Yu, D. (2014). Convolutional neural networks for speech recognition. IEEE/ACM Transactions on audio, speech, and language processing, 22(10), 1533-1545.
  13. Schädler, M. R., & Kollmeier, B. (2012) Normalization of Spectro-Temporal Gabor Filter Bank Features for Improved Robust Automatic Speech Recognition Systems. In : In Thirteenth Annual Conference of the International Speech Communication Association.
  14. Schädler, Marc René; Meyer, Bernd T.; Kollmeier, Birger (2012) Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition. In : The Journal of the Acoustical Society of America, vol. 131, n° 5, p. 4134–4151. DOI:10.1121/1.3699200.
  15. Zhao, J., Li, R., Tian, M., & An, W. (2024). Multi-view self-supervised learning and multi-scale feature fusion for automatic speech recognition. Neural Processing Letters, 56(3), 168.
  16. A. Mahmoudi and M. Deriche, (2004). "CNN-BiLSTM Architectures for Arabic Speech Recognition under Noise and Compression," Neural Computing and Applications, 2024.
  17. Djeffal, N., Addou, D., Kheddar, H., & Selouani, S. A. (2023). Noise-robust speech recognition: A comparative analysis of LSTM and CNN approaches. In 2023 2nd International Conference on Electronics, Energy and Measurement (IC2EM) (Vol. 1, pp. 1-6). IEEE.
  18. Dietz, M., Multrus, M., Eksler, V., Malenovsky, V., Norvell, E., Pobloth, H., ... & Zhu, C. (2015). Overview of the EVS codec architecture. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5698-5702). IEEE.
  19. Wankhede, N., & Wagh, S. (2023). Enhancing biometric speaker recognition through MFCC feature extraction and polar codes for remote application. IEEE Access, 11, 133921-133930.