Multimodal emotion recognition has recently drawn a lot of interest in affective computing as it has immense potential to outperform isolated unimodal approaches. Audio and visual modalities are two predominant contact-free channels in videos, which are often expected to carry a complementary relationship with each other. However, audio and visual channels may not always be complementary with each other, resulting in poor audio-visual feature representations, thereby degrading the performance of the system. In this paper, we propose a flexible audio-visual fusion model that can adapt to weak complementary relationships using a gated attention mechanism. Specifically, we extend the recursive joint cross-attention model by introducing gating mechanism in every iteration to control the flow of information between the input features and the attended features depending on the strength of their complementary relationship. For instance, if the modalities exhibit strong complementary relationships, the gating mechanism chooses cross-attended features, otherwise non-attended features. To further improve the performance of the system, we further introduce stage gating mechanism, which is used to control the flow of information across the gated outputs of each iteration. Therefore, the proposed model improves the performance of the system even when the audio and visual modalities do not have a strong complementary relationship with each other by adding more flexibility to the recursive joint cross attention mechanism. The proposed model has been evaluated on the challenging Affwild2 dataset and significantly outperforms the state-of-the-art fusion approaches.
翻译:多模态情感识别因其超越孤立单模态方法的巨大潜力,近期在情感计算领域引起了广泛关注。音频与视觉模态是视频中两种主要的非接触式通道,通常预期彼此间存在互补关系。然而,音频与视觉通道并非始终具有互补性,这可能导致音视频特征表示质量低下,进而降低系统性能。本文提出一种灵活的音视频融合模型,该模型可通过门控注意力机制适应弱互补关系。具体而言,我们在递归联合交叉注意力模型的每次迭代中引入门控机制,根据互补关系的强度控制输入特征与注意力特征之间的信息流动。例如,若模态间呈现强互补关系,门控机制选择交叉注意力特征,反之则选择非注意力特征。为进一步提升系统性能,我们还引入了阶段门控机制,用于控制每次迭代的门控输出之间的信息流动。因此,即使音频与视觉模态间不存在强互补关系,所提模型仍能通过增强递归联合交叉注意力机制的灵活性来改善系统性能。该模型已在具有挑战性的Affwild2数据集上进行评估,其性能显著优于当前最先进的融合方法。