Multimodal sentiment analysis (MSA) identifies individuals' sentiment states in videos by integrating visual, audio, and text modalities. Despite progress in existing methods, the inherent modality heterogeneity limits the effective capture of interactive sentiment features across modalities. In this paper, by introducing a Multi-Modality Collaborative Learning (MMCL) framework, we facilitate cross-modal interactions and capture enhanced and complementary features from modality-common and modality-specific representations, respectively. Specifically, we design a parameter-free decoupling module and separate uni-modality into modality-common and modality-specific components through semantics assessment of cross-modal elements. For modality-specific representations, inspired by the act-reward mechanism in reinforcement learning, we design policy models to adaptively mine complementary sentiment features under the guidance of a joint reward. For modality-common representations, intra-modal attention is employed to highlight crucial components, playing enhanced roles among modalities. Experimental results, including superiority evaluations on four databases, effectiveness verification of each module, and assessment of complementary features, demonstrate that MMCL successfully learns collaborative features across modalities and significantly improves performance. The code can be available at https://github.com/smwanghhh/MMCL.
翻译:多模态情感分析(MSA)通过整合视觉、音频和文本模态来识别视频中个体的情感状态。尽管现有方法已取得进展,但固有的模态异质性限制了跨模态交互情感特征的有效捕捉。本文通过引入多模态协同学习(MMCL)框架,促进跨模态交互,并分别从模态共有和模态特定表示中捕获增强与互补特征。具体而言,我们设计了一个无参数解耦模块,通过对跨模态元素的语义评估,将单模态分离为模态共有和模态特定组件。对于模态特定表示,受强化学习中动作-奖励机制的启发,我们设计了策略模型,在联合奖励的指导下自适应挖掘互补情感特征。对于模态共有表示,采用模态内注意力机制以突出关键组件,在模态间发挥增强作用。实验结果包括在四个数据库上的优越性评估、各模块的有效性验证以及互补特征的评估,均表明MMCL成功学习了跨模态的协同特征,并显著提升了性能。代码可在https://github.com/smwanghhh/MMCL获取。