Audiovisual emotion recognition (ER) in videos has immense potential over unimodal performance. It effectively leverages the inter- and intra-modal dependencies between visual and auditory modalities. This work proposes a novel audio-visual emotion recognition system utilizing a joint multimodal transformer architecture with key-based cross-attention. This framework aims to exploit the complementary nature of audio and visual cues (facial expressions and vocal patterns) in videos, leading to superior performance compared to solely relying on a single modality. The proposed model leverages separate backbones for capturing intra-modal temporal dependencies within each modality (audio and visual). Subsequently, a joint multimodal transformer architecture integrates the individual modality embeddings, enabling the model to effectively capture inter-modal (between audio and visual) and intra-modal (within each modality) relationships. Extensive evaluations on the challenging Affwild2 dataset demonstrate that the proposed model significantly outperforms baseline and state-of-the-art methods in ER tasks.
翻译:视频中的视听情感识别(ER)在单模态性能之上具有巨大潜力,能够有效利用视觉与听觉模态之间的模态间与模态内依赖关系。本文提出一种新颖的视听情感识别系统,采用基于键值交叉注意力的联合多模态Transformer架构。该框架旨在挖掘视频中音频与视觉线索(面部表情与声音模式)的互补特性,从而在性能上优于单一模态方法。所提模型利用独立骨干网络分别捕获各模态(音频与视觉)内的时序依赖关系,随后通过联合多模态Transformer架构整合各模态嵌入表征,使模型能够有效捕获模态间(音频与视觉之间)及模态内(各模态内部)的关联。在具有挑战性的Affwild2数据集上的大量评估表明,所提模型在情感识别任务上显著优于基线方法与现有最优技术。