Multi-modal multi-label emotion recognition (MMER) aims to identify relevant emotions from multiple modalities. The challenge of MMER is how to effectively capture discriminative features for multiple labels from heterogeneous data. Recent studies are mainly devoted to exploring various fusion strategies to integrate multi-modal information into a unified representation for all labels. However, such a learning scheme not only overlooks the specificity of each modality but also fails to capture individual discriminative features for different labels. Moreover, dependencies of labels and modalities cannot be effectively modeled. To address these issues, this paper presents ContrAstive feature Reconstruction and AggregaTion (CARAT) for the MMER task. Specifically, we devise a reconstruction-based fusion mechanism to better model fine-grained modality-to-label dependencies by contrastively learning modal-separated and label-specific features. To further exploit the modality complementarity, we introduce a shuffle-based aggregation strategy to enrich co-occurrence collaboration among labels. Experiments on two benchmark datasets CMU-MOSEI and M3ED demonstrate the effectiveness of CARAT over state-of-the-art methods. Code is available at https://github.com/chengzju/CARAT.
翻译:多模态多标签情感识别(MMER)旨在从多种模态中识别相关情感。MMER的挑战在于如何从异构数据中有效捕捉多个标签的判别性特征。现有研究主要致力于探索各种融合策略,将多模态信息整合为所有标签的统一表征。然而,这种学习方案不仅忽视了每种模态的特异性,而且未能捕捉不同标签各自的判别性特征。此外,标签与模态之间的依赖关系也无法得到有效建模。针对这些问题,本文提出了面向MMER任务的对比特征重建与聚合(CARAT)方法。具体而言,我们设计了一种基于重建的融合机制,通过对比学习模态分离特征与标签特定特征,以更好地建模细粒度的模态-标签依赖关系。为了进一步挖掘模态互补性,我们引入了一种基于打乱的重排序聚合策略,以增强标签间的共现协作。在CMU-MOSEI和M3ED两个基准数据集上的实验表明,CARAT方法优于现有最先进方法。代码已开源在https://github.com/chengzju/CARAT。