Multimodal Sentiment Analysis (MSA) aims to understand human intentions by integrating emotion-related clues from diverse modalities, such as visual, language, and audio. Unfortunately, the current MSA task invariably suffers from unplanned dataset biases, particularly multimodal utterance-level label bias and word-level context bias. These harmful biases potentially mislead models to focus on statistical shortcuts and spurious correlations, causing severe performance bottlenecks. To alleviate these issues, we present a Multimodal Counterfactual Inference Sentiment (MCIS) analysis framework based on causality rather than conventional likelihood. Concretely, we first formulate a causal graph to discover harmful biases from already-trained vanilla models. In the inference phase, given a factual multimodal input, MCIS imagines two counterfactual scenarios to purify and mitigate these biases. Then, MCIS can make unbiased decisions from biased observations by comparing factual and counterfactual outcomes. We conduct extensive experiments on several standard MSA benchmarks. Qualitative and quantitative results show the effectiveness of the proposed framework.
翻译:多模态情感分析旨在通过整合视觉、语言和音频等多模态的情感相关线索来理解人类意图。然而,当前的多模态情感分析任务不可避免地受到非预期数据集偏差的影响,特别是多模态话语级标签偏差和词语级上下文偏差。这些有害偏差可能误导模型关注统计捷径和虚假相关性,导致严重的性能瓶颈。为解决这些问题,我们提出了一种基于因果推断而非传统似然的多模态反事实推理情感分析框架。具体而言,我们首先构建因果图,从已训练的原始模型中识别有害偏差。在推理阶段,给定一个事实多模态输入,该框架通过设想两种反事实场景来净化和缓解这些偏差。随后,该框架通过比较事实结果与反事实结果,从有偏观测中做出无偏决策。我们在多个标准多模态情感分析基准上进行了广泛实验。定性与定量结果均验证了所提框架的有效性。