Multimodal intention understanding (MIU) is an indispensable component of human expression analysis (e.g., sentiment or humor) from heterogeneous modalities, including visual postures, linguistic contents, and acoustic behaviors. Existing works invariably focus on designing sophisticated structures or fusion strategies to achieve impressive improvements. Unfortunately, they all suffer from the subject variation problem due to data distribution discrepancies among subjects. Concretely, MIU models are easily misled by distinct subjects with different expression customs and characteristics in the training data to learn subject-specific spurious correlations, significantly limiting performance and generalizability across uninitiated subjects.Motivated by this observation, we introduce a recapitulative causal graph to formulate the MIU procedure and analyze the confounding effect of subjects. Then, we propose SuCI, a simple yet effective causal intervention module to disentangle the impact of subjects acting as unobserved confounders and achieve model training via true causal effects. As a plug-and-play component, SuCI can be widely applied to most methods that seek unbiased predictions. Comprehensive experiments on several MIU benchmarks clearly demonstrate the effectiveness of the proposed module.
翻译:多模态意图理解(MIU)是人类表达分析(例如情感或幽默)中不可或缺的组成部分,其分析对象来自异构模态,包括视觉姿态、语言内容和声学行为。现有工作无一例外地侧重于设计复杂的结构或融合策略,以实现显著的性能提升。然而,由于不同主体间存在数据分布差异,这些方法均受主体变异问题的困扰。具体而言,训练数据中具有不同表达习惯和特征的不同主体容易误导MIU模型学习到主体特定的伪相关性,这严重限制了模型在未见主体上的性能和泛化能力。受此观察启发,我们引入了一个概括性的因果图来形式化MIU过程,并分析了主体的混淆效应。接着,我们提出了SuCI,一个简单而有效的因果干预模块,旨在解耦作为未观测混杂因子的主体的影响,并通过真实的因果效应实现模型训练。作为一个即插即用组件,SuCI可以广泛应用于大多数寻求无偏预测的方法。在多个MIU基准测试上的综合实验清楚地证明了所提出模块的有效性。