Multi-modal multi-label emotion recognition (MMER) aims to identify relevant emotions from multiple modalities. The challenge of MMER is how to effectively capture discriminative features for multiple labels from heterogeneous data. Recent studies are mainly devoted to exploring various fusion strategies to integrate multi-modal information into a unified representation for all labels. However, such a learning scheme not only overlooks the specificity of each modality but also fails to capture individual discriminative features for different labels. Moreover, dependencies of labels and modalities cannot be effectively modeled. To address these issues, this paper presents ContrAstive feature Reconstruction and AggregaTion (CARAT) for the MMER task. Specifically, we devise a reconstruction-based fusion mechanism to better model fine-grained modality-to-label dependencies by contrastively learning modal-separated and label-specific features. To further exploit the modality complementarity, we introduce a shuffle-based aggregation strategy to enrich co-occurrence collaboration among labels. Experiments on two benchmark datasets CMU-MOSEI and M3ED demonstrate the effectiveness of CARAT over state-of-the-art methods. Code is available at https://github.com/chengzju/CARAT.
翻译:多模态多标签情感识别(MMER)旨在从多种模态中识别相关情感。其挑战在于如何从异构数据中有效捕获针对多个标签的判别性特征。现有研究主要致力于探索各种融合策略,将多模态信息整合为面向所有标签的统一表示。然而,这种学习机制不仅忽略了每个模态的特异性,也未能为不同标签捕获独立的判别性特征,且无法有效建模标签与模态之间的依赖关系。为解决上述问题,本文提出面向MMER任务的对比特征重建与聚合方法(CARAT)。具体而言,我们设计了一种基于重建的融合机制,通过对比学习模态分离和标签特异性特征,以更精细地建模模态到标签的依赖关系。为进一步利用模态互补性,我们引入了一种基于混洗的聚合策略,以增强标签间的共现协作。在CMU-MOSEI和M3ED两个基准数据集上的实验表明,CARAT相较于现有最先进方法具有优越性。代码已开源至https://github.com/chengzju/CARAT。