Learning effective joint representations has been a central task in multimodal sentiment analysis. Previous methods focus on leveraging the correlations between different modalities and enhancing performance through sophisticated fusion techniques. However, challenges still exist due to the inherent heterogeneity of distinct modalities, which may lead to distributional gap, impeding the full exploitation of inter-modal information and resulting in redundancy and impurity in the information extracted from features. To address this problem, we introduce the Multimodal Information Disentanglement (MInD) approach. MInD decomposes the multimodal inputs into a modality-invariant component, a modality-specific component, and a remnant noise component for each modality through a shared encoder and multiple private encoders. The shared encoder aims to explore the shared information and commonality across modalities, while the private encoders are deployed to capture the distinctive information and characteristic features. These representations thus furnish a comprehensive perspective of the multimodal data, facilitating the fusion process instrumental for subsequent prediction tasks. Furthermore, MInD improves the learned representations by explicitly modeling the task-irrelevant noise in an adversarial manner. Experimental evaluations conducted on benchmark datasets, including CMU-MOSI, CMU-MOSEI, and UR-Funny, demonstrate MInD's superior performance over existing state-of-the-art methods in both multimodal emotion recognition and multimodal humor detection tasks.
翻译:学习有效的联合表征一直是多模态情感分析的核心任务。现有方法侧重于利用不同模态间的相关性,并通过复杂的融合技术提升性能。然而,由于不同模态固有的异质性,仍存在分布差异等挑战,这阻碍了模态间信息的充分利用,并导致从特征中提取的信息存在冗余和不纯。为解决这一问题,我们提出了多模态信息解耦(MInD)方法。MInD通过共享编码器和多个私有编码器,将多模态输入分解为每个模态的模态不变分量、模态特定分量和残余噪声分量。共享编码器旨在探索跨模态的共享信息和共性,而私有编码器则用于捕捉独特信息和特征表征。这些表征为多模态数据提供了全面视角,促进了后续预测任务的融合过程。此外,MInD通过显式地对任务无关噪声进行对抗性建模,改进了学习到的表征。在CMU-MOSI、CMU-MOSEI和UR-Funny等基准数据集上的实验评估表明,MInD在多模态情感识别和多模态幽默检测任务中均优于现有最先进方法。