This paper introduces MAVEN (Multi-modal Attention for Valence-Arousal Emotion Network), a novel architecture for dynamic emotion recognition through dimensional modeling of affect. The model uniquely integrates visual, audio, and textual modalities via a bi-directional cross-modal attention mechanism with six distinct attention pathways, enabling comprehensive interactions between all modality pairs. Our proposed approach employs modality-specific encoders to extract rich feature representations from synchronized video frames, audio segments, and transcripts. The architecture's novelty lies in its cross-modal enhancement strategy, where each modality representation is refined through weighted attention from other modalities, followed by self-attention refinement through modality-specific encoders. Rather than directly predicting valence-arousal values, MAVEN predicts emotions in a polar coordinate form, aligning with psychological models of the emotion circumplex. Experimental evaluation on the Aff-Wild2 dataset demonstrates the effectiveness of our approach, with performance measured using Concordance Correlation Coefficient (CCC). The multi-stage architecture demonstrates superior ability to capture the complex, nuanced nature of emotional expressions in conversational videos, advancing the state-of-the-art (SOTA) in continuous emotion recognition in-the-wild. Code can be found at: https://github.com/Vrushank-Ahire/MAVEN_8th_ABAW.
翻译:本文提出MAVEN(面向效价-唤醒情感网络的多模态注意力机制),这是一种通过情感维度建模实现动态情绪识别的创新架构。该模型通过具有六条独立注意力通路的双向跨模态注意力机制,独特地整合了视觉、听觉和文本模态,实现了所有模态对之间的全面交互。我们提出的方法采用模态专用编码器,从同步视频帧、音频片段和文本转录中提取丰富的特征表示。该架构的创新性在于其跨模态增强策略:每个模态表示首先通过其他模态的加权注意力进行优化,随后经由模态专用编码器进行自注意力精炼。MAVEN并非直接预测效价-唤醒值,而是以极坐标形式预测情绪,这与情绪环状模型的心理理论模型保持一致。在Aff-Wild2数据集上的实验评估证明了我们方法的有效性,其性能通过一致性相关系数(CCC)进行度量。这种多阶段架构展现出捕捉对话视频中复杂微妙情绪表达特征的卓越能力,推动了野外连续情绪识别领域的最先进水平。代码可见:https://github.com/Vrushank-Ahire/MAVEN_8th_ABAW。