Multimodal Sentiment Analysis (MSA) aims to understand human intentions by integrating emotion-related clues from diverse modalities, such as visual, language, and audio. Unfortunately, the current MSA task invariably suffers from unplanned dataset biases, particularly multimodal utterance-level label bias and word-level context bias. These harmful biases potentially mislead models to focus on statistical shortcuts and spurious correlations, causing severe performance bottlenecks. To alleviate these issues, we present a Multimodal Counterfactual Inference Sentiment (MCIS) analysis framework based on causality rather than conventional likelihood. Concretely, we first formulate a causal graph to discover harmful biases from already-trained vanilla models. In the inference phase, given a factual multimodal input, MCIS imagines two counterfactual scenarios to purify and mitigate these biases. Then, MCIS can make unbiased decisions from biased observations by comparing factual and counterfactual outcomes. We conduct extensive experiments on several standard MSA benchmarks. Qualitative and quantitative results show the effectiveness of the proposed framework.
翻译:多模态情感分析(MSA)旨在通过整合来自视觉、语言和音频等多种模态的情感相关线索来理解人类意图。然而,当前MSA任务普遍受到非预期的数据集偏置影响,尤其是多模态话语层面的标签偏置和词汇层面的上下文偏置。这些有害偏置可能误导模型关注统计捷径和虚假相关性,导致严重的性能瓶颈。为缓解这些问题,我们提出了一种基于因果性而非传统似然的多模态反事实推理情感(MCIS)分析框架。具体而言,我们首先构建因果图以从已训练的原始模型中识别有害偏置。在推理阶段,给定一个事实多模态输入,MCIS通过构想两种反事实场景来净化并缓解这些偏置。随后,MCIS通过比较事实与反事实结果,能够从带偏观测中做出无偏决策。我们在多个标准MSA基准上进行了广泛实验。定性与定量结果均证明了所提框架的有效性。