The prevalence of sarcasm in multimodal dialogues on the social platforms presents a crucial yet challenging task for understanding the true intent behind online content. Comprehensive sarcasm analysis requires two key aspects: Multimodal Sarcasm Detection (MSD) and Multimodal Sarcasm Explanation (MuSE). Intuitively, the act of detection is the result of the reasoning process that explains the sarcasm. Current research predominantly focuses on addressing either MSD or MuSE as a single task. Even though some recent work has attempted to integrate these tasks, their inherent causal dependency is often overlooked. To bridge this gap, we propose MuVaC, a variational causal inference framework that mimics human cognitive mechanisms for understanding sarcasm, enabling robust multimodal feature learning to jointly optimize MSD and MuSE. Specifically, we first model MSD and MuSE from the perspective of structural causal models, establishing variational causal pathways to define the objectives for joint optimization. Next, we design an alignment-then-fusion approach to integrate multimodal features, providing robust fusion representations for sarcasm detection and explanation generation. Finally, we enhance the reasoning trustworthiness by ensuring consistency between detection results and explanations. Experimental results demonstrate the superiority of MuVaC in public datasets, offering a new perspective for understanding multimodal sarcasm.
翻译:社交平台上多模态对话中讽刺的普遍存在,为理解在线内容背后的真实意图提出了一项关键而具有挑战性的任务。全面的讽刺分析需要两个关键方面:多模态讽刺检测(MSD)和多模态讽刺解释(MuSE)。直观上,检测行为是解释讽刺的推理过程的结果。当前的研究主要集中于将MSD或MuSE作为单一任务来处理。尽管最近的一些工作尝试整合这些任务,但它们内在的因果依赖性常常被忽视。为了弥合这一差距,我们提出了MuVaC,一个模仿人类理解讽刺认知机制的变分因果推理框架,能够实现鲁棒的多模态特征学习,以联合优化MSD和MuSE。具体而言,我们首先从结构因果模型的角度对MSD和MuSE进行建模,建立变分因果路径来定义联合优化的目标。接着,我们设计了一种先对齐后融合的方法来整合多模态特征,为讽刺检测和解释生成提供鲁棒的融合表示。最后,我们通过确保检测结果与解释之间的一致性来增强推理的可信度。实验结果证明了MuVaC在公共数据集上的优越性,为理解多模态讽刺提供了新的视角。