Multimodal intention understanding (MIU) is an indispensable component of human expression analysis (e.g., sentiment or humor) from heterogeneous modalities, including visual postures, linguistic contents, and acoustic behaviors. Existing works invariably focus on designing sophisticated structures or fusion strategies to achieve impressive improvements. Unfortunately, they all suffer from the subject variation problem due to data distribution discrepancies among subjects. Concretely, MIU models are easily misled by distinct subjects with different expression customs and characteristics in the training data to learn subject-specific spurious correlations, significantly limiting performance and generalizability across uninitiated subjects.Motivated by this observation, we introduce a recapitulative causal graph to formulate the MIU procedure and analyze the confounding effect of subjects. Then, we propose SuCI, a simple yet effective causal intervention module to disentangle the impact of subjects acting as unobserved confounders and achieve model training via true causal effects. As a plug-and-play component, SuCI can be widely applied to most methods that seek unbiased predictions. Comprehensive experiments on several MIU benchmarks clearly demonstrate the effectiveness of the proposed module.
翻译:多模态意图理解(MIU)是从异构模态(包括视觉姿态、语言内容和声学行为)分析人类表达(如情感或幽默)不可或缺的组成部分。现有研究始终聚焦于设计复杂的结构或融合策略以实现显著性能提升。然而,这些方法均因数据分布在不同主体间的差异而受到主体变异问题的影响。具体而言,MIU模型容易受到训练数据中具有不同表达习惯和特征的不同主体误导,从而学习到主体特定的虚假关联,严重限制了模型性能及其对未参与训练主体的泛化能力。受此观察启发,我们引入一个概括性因果图来形式化MIU过程并分析主体的混杂效应。随后,我们提出SuCI——一种简单而有效的因果干预模块,以解耦作为未观测混杂因素的主体的影响,并通过真实因果效应实现模型训练。作为即插即用组件,SuCI可广泛适用于大多数追求无偏预测的方法。在多个MIU基准上的综合实验清楚证明了所提模块的有效性。