Understanding emotions is a fundamental aspect of human communication. Integrating audio and video signals offers a more comprehensive understanding of emotional states compared to traditional methods that rely on a single data source, such as speech or facial expressions. Despite its potential, multimodal emotion recognition faces significant challenges, particularly in synchronization, feature extraction, and fusion of diverse data sources. To address these issues, this paper introduces a novel transformer-based model named Audio-Video Transformer Fusion with Cross Attention (AVT-CA). The AVT-CA model employs a transformer fusion approach to effectively capture and synchronize interlinked features from both audio and video inputs, thereby resolving synchronization problems. Additionally, the Cross Attention mechanism within AVT-CA selectively extracts and emphasizes critical features while discarding irrelevant ones from both modalities, addressing feature extraction and fusion challenges. Extensive experimental analysis conducted on the CMU-MOSEI, RAVDESS and CREMA-D datasets demonstrates the efficacy of the proposed model. The results underscore the importance of AVT-CA in developing precise and reliable multimodal emotion recognition systems for practical applications.
翻译:理解情感是人类交流的基本方面。相较于依赖单一数据源(如语音或面部表情)的传统方法,整合音频和视频信号能够更全面地理解情感状态。尽管具有潜力,多模态情感识别仍面临重大挑战,特别是在异构数据源的同步、特征提取与融合方面。为解决这些问题,本文提出了一种基于Transformer的新型模型——基于交叉注意力的音视频Transformer融合模型(AVT-CA)。该模型采用Transformer融合方法,有效捕获并同步来自音频和视频输入的关联特征,从而解决同步问题。此外,AVT-CA中的交叉注意力机制能够选择性地提取并强化双模态中的关键特征,同时摒弃无关特征,从而应对特征提取与融合的挑战。在CMU-MOSEI、RAVDESS和CREMA-D数据集上进行的大量实验分析验证了所提模型的有效性。研究结果凸显了AVT-CA在开发精确可靠的多模态情感识别系统方面的重要应用价值。