Modality representation learning is an important problem for multimodal sentiment analysis (MSA), since the highly distinguishable representations can contribute to improving the analysis effect. Previous works of MSA have usually focused on multimodal fusion strategies, and the deep study of modal representation learning was given less attention. Recently, contrastive learning has been confirmed effective at endowing the learned representation with stronger discriminate ability. Inspired by this, we explore the improvement approaches of modality representation with contrastive learning in this study. To this end, we devise a three-stages framework with multi-view contrastive learning to refine representations for the specific objectives. At the first stage, for the improvement of unimodal representations, we employ the supervised contrastive learning to pull samples within the same class together while the other samples are pushed apart. At the second stage, a self-supervised contrastive learning is designed for the improvement of the distilled unimodal representations after cross-modal interaction. At last, we leverage again the supervised contrastive learning to enhance the fused multimodal representation. After all the contrast trainings, we next achieve the classification task based on frozen representations. We conduct experiments on three open datasets, and results show the advance of our model.
翻译:模态表示学习是多模态情感分析(MSA)中的一个重要问题,因为高度可区分的表示有助于提升分析效果。以往MSA研究通常侧重于多模态融合策略,对模态表示学习的深入探讨关注不足。近年来,对比学习已被证实能有效赋予学习到的表示更强的判别能力。受此启发,本研究探索利用对比学习改进模态表示的方法。为此,我们设计了一个三阶段框架,通过多视角对比学习针对特定目标优化表示。第一阶段,为改进单模态表示,我们采用监督对比学习,将同类样本拉近,同时推远其他样本。第二阶段,针对跨模态交互后提炼的单模态表示,我们设计了自监督对比学习以进一步提升其质量。最后,我们再次利用监督对比学习增强融合后的多模态表示。完成所有对比训练后,我们基于冻结的表示进行分类任务。我们在三个公开数据集上进行了实验,结果证明了我们模型的先进性。