Multi-modal multi-label emotion recognition (MMER) aims to identify relevant emotions from multiple modalities. The challenge of MMER is how to effectively capture discriminative features for multiple labels from heterogeneous data. Recent studies are mainly devoted to exploring various fusion strategies to integrate multi-modal information into a unified representation for all labels. However, such a learning scheme not only overlooks the specificity of each modality but also fails to capture individual discriminative features for different labels. Moreover, dependencies of labels and modalities cannot be effectively modeled. To address these issues, this paper presents ContrAstive feature Reconstruction and AggregaTion (CARAT) for the MMER task. Specifically, we devise a reconstruction-based fusion mechanism to better model fine-grained modality-to-label dependencies by contrastively learning modal-separated and label-specific features. To further exploit the modality complementarity, we introduce a shuffle-based aggregation strategy to enrich co-occurrence collaboration among labels. Experiments on two benchmark datasets CMU-MOSEI and M3ED demonstrate the effectiveness of CARAT over state-of-the-art methods. Code is available at https://github.com/chengzju/CARAT.
翻译:多模态多标签情感识别(MMER)旨在从多模态数据中识别相关情感。MMER的挑战在于如何从异构数据中有效捕获针对多个标签的判别性特征。现有研究主要致力于探索各种融合策略,将多模态信息整合为针对所有标签的统一表示。然而,这种学习方案不仅忽视了各模态的特异性,也无法捕获不同标签各自的判别性特征,且不能有效建模标签与模态之间的依赖关系。针对这些问题,本文提出面向MMER任务的对比特征重建与聚合(CARAT)方法。具体而言,我们设计了一种基于重建的融合机制,通过对比学习模态分离特征和标签特定特征,以更精细地建模细粒度的模态-标签依赖关系。为进一步利用模态互补性,我们引入了一种基于打乱的聚合策略来增强标签间的共现协作。在CMU-MOSEI和M3ED两个基准数据集上的实验表明,CARAT方法优于现有最先进技术。代码已在https://github.com/chengzju/CARAT 开源。