Although Multimodal Sentiment Analysis (MSA) effectively leverages rich information from language, visual, and acoustic modalities, existing methods still face two core challenges: 1) static conflict suppression mechanisms fail to adapt to dynamic variations across samples, and 2) the inherent sentimental bias within the language modality, which can misguide learning from other modalities, remains entangled. To this end, we propose a Dynamic Multimodal Causal Disentanglement and Adaptive Fusion Framework (MCAF). Its cornerstone is the Multi-Granularity Causal Dynamic Router and a Conditional Diffusion Denoising Module. First, we introduce a causal intervention module based on the information bottleneck principle, which builds a Structural Causal Model to disentangle sentimental bias from language features, yielding a "de-confounded" language representation as a pure guiding signal. Second, we devise a Dynamic Multimodal Router that evaluates the interaction states (complementary, conflicting, or redundant) among visual, acoustic, and de-confounded language signals in real-time across three levels: feature, temporal, and modality, then adaptively allocates weights and routes information flow for fine-grained regulation. Finally, a lightweight Conditional Diffusion Denoising Module performs iterative denoising on the fused joint representation to explicitly filter out residual irrelevant information, generating a robust hyper-modality representation. Extensive experiments on the CMU-MOSI and CMU-MOSEI benchmarks show that MCAF sets new state-of-the-art on key classification metrics, achieving an Acc-2/F1 of 86.52%/86.51% on MOSI and 86.72%/86.65% on MOSEI, while remaining highly competitive on others. Comprehensive analyses and visualizations further validate its efficacy in dynamically perceiving interactions, disentangling bias, and enhancing interpretability.
翻译:尽管多模态情感分析(MSA)能够有效利用语言、视觉和声学模态中的丰富信息,现有方法仍面临两个核心挑战:1)静态冲突抑制机制无法适应样本间的动态变化;2)语言模态固有的情感偏差会误导其他模态的学习,且该偏差尚未被解耦。为此,我们提出动态多模态因果解耦与自适应融合框架(MCAF)。其核心是多粒度因果动态路由器与条件扩散去噪模块。首先,我们引入基于信息瓶颈原理的因果干预模块,通过构建结构因果模型从语言特征中解耦情感偏差,得到“去混杂”的语言表征作为纯净引导信号。其次,我们设计动态多模态路由器,在特征、时序和模态三个层面实时评估视觉、声学及去混杂语言信号间的交互状态(互补、冲突或冗余),进而自适应分配权重并路由信息流以实现细粒度调控。最后,轻量级条件扩散去噪模块对融合后的联合表征进行迭代去噪,显式滤除残余无关信息,生成鲁棒的元模态表征。在CMU-MOSI和CMU-MOSEI基准上的大量实验表明,MCAF在关键分类指标上取得了新的最优结果,在MOSI上Acc-2/F1达86.52%/86.51%,在MOSEI上达86.72%/86.65%,同时在其他指标上仍保持强竞争力。全面的分析与可视化进一步验证了其在动态感知交互、解耦偏差及增强可解释性方面的有效性。