A Multimodal ASR System with Contextual Awareness and Emotional Sensitivity
Keywords:
Multimodal Automatic Speech Recognition (ASR), Arabic Speech Recognition, Emotion Detection, Audio-Visual Speech Processing, Wav2Vec 2.0, Lip Reading, AVANEmo DatasetAbstract
The increasing demand for accurate speech recognition systems in diverse languages, particularly Arabic, poses significant challenges due to variations in dialects, background noise, and emotional context. Traditional Automatic Speech Recognition (ASR) models often struggle to maintain high accuracy in the presence of these factors, leading to suboptimal performance in real-world applications. This study presents a novel Multimodal ASR system that addresses these challenges by integrating audio, visual, and emotional cues to enhance both transcription accuracy and emotion detection for Arabic speech.
The proposed model was evaluated on the Audio-Visual Arabic Natural Emotion (AVANEmo) dataset, employing state-of-the-art techniques, including Wav2Vec 2.0 for audio feature extraction, convolutional neural networks for lip movement recognition, and a contextual language model to refine outputs. The system achieved a Word Error Rate (WER) of 16.3% and a Character Error Rate (CER) of 10.7%, outperforming existing models such as DeepSpeech (19.4% WER, 13.7% CER) and Jasper (18.2% WER, 12.9% CER). Moreover, the proposed model demonstrated a notable accuracy of 88.9% for emotion detection, significantly surpassing the performance of previous models, which reported 84.2% accuracy. These results underscore the efficacy of the multimodal approach in enhancing Arabic speech recognition and emotion classification, highlighting its potential for real-world applications.