Multimodal emotion recognition (MER) aims to infer human affect by jointly modeling audio and visual cues; however, existing approaches often struggle with temporal misalignment, weakly discriminative feature representations, and suboptimal fusion of heterogeneous modalities. To address these challenges, we propose AVT-CA, an Audio-Video Transformer architecture with cross attention for robust emotion recognition. The proposed model introduces a hierarchical video feature representation that combines channel attention, spatial attention, and local feature extraction to emphasize emotionally salient regions while suppressing irrelevant information. These refined visual features are integrated with audio representations through an intermediate transformer-based fusion mechanism that captures interlinked temporal dependencies across modalities. Furthermore, a cross-attention module selectively reinforces mutually consistent audio-visual cues, enabling effective feature selection and noise-aware fusion. Extensive experiments on three benchmark datasets, CMU-MOSEI, RAVDESS, and CREMA-D, demonstrate that AVT-CA consistently outperforms state-of-the-art baselines, achieving significant improvements in both accuracy and F1-score. Our source code is publicly available at https://github.com/shravan-18/AVTCA.
翻译:多模态情感识别旨在通过联合建模音频与视觉线索来推断人类情感状态,然而现有方法常面临时序错位、特征表示区分性弱以及异质模态融合欠佳等问题。为应对这些挑战,我们提出AVT-CA——一种基于跨注意力机制的音频-视频Transformer架构,用于实现鲁棒的情感识别。该模型引入分层视频特征表示,结合通道注意力、空间注意力与局部特征提取机制,以强化情感显著区域并抑制无关信息。这些精细化视觉特征通过基于Transformer的中间融合机制与音频表征进行集成,该机制能捕获跨模态的互联时序依赖关系。此外,跨注意力模块选择性地增强相互一致的视听线索,实现有效的特征选择与噪声感知融合。在CMU-MOSEI、RAVDESS和CREMA-D三个基准数据集上的大量实验表明,AVT-CA在准确率与F1分数上均显著超越现有最优基线模型。源代码已公开于https://github.com/shravan-18/AVTCA。