Multimodal learning has exhibited a significant advantage in affective analysis tasks owing to the comprehensive information of various modalities, particularly the complementary information. Thus, many emerging studies focus on disentangling the modality-invariant and modality-specific representations from input data and then fusing them for prediction. However, our study shows that modality-specific representations may contain information that is irrelevant or conflicting with the tasks, which downgrades the effectiveness of learned multimodal representations. We revisit the disentanglement issue, and propose a novel triple disentanglement approach, TriDiRA, which disentangles the modality-invariant, effective modality-specific and ineffective modality-specific representations from input data. By fusing only the modality-invariant and effective modality-specific representations, TriDiRA can significantly alleviate the impact of irrelevant and conflicting information across modalities during model training. Extensive experiments conducted on four benchmark datasets demonstrate the effectiveness and generalization of our triple disentanglement, which outperforms SOTA methods.
翻译:多模态学习因融合了不同模态的全面信息(尤其是互补信息),在情感分析任务中展现出显著优势。因此,许多新兴研究致力于从输入数据中解缠模态不变表征和模态特有表征,再将二者融合以进行预测。然而,我们的研究表明,模态特有表征可能包含与目标任务无关或冲突的信息,这会削弱所学多模态表征的有效性。我们重新审视解缠问题,提出一种新颖的三重解缠方法TriDiRA,该方法从输入数据中解缠出模态不变表征、有效模态特有表征和无效模态特有表征。通过仅融合模态不变表征与有效模态特有表征,TriDiRA可显著降低模型训练过程中跨模态的无关信息和冲突信息的影响。在四个基准数据集上开展的大量实验证明,我们的三重解缠方法具有高效性与泛化能力,其性能优于当前最优方法。