Multimodal sarcasm detection has attracted growing interest due to the rise of multimedia posts on social media. Understanding sarcastic image-text posts often requires external contextual knowledge, such as cultural references or commonsense reasoning. However, existing models struggle to capture the deeper rationale behind sarcasm, relying mainly on shallow cues like image captions or object-attribute pairs from images. To address this, we propose \textbf{MiDRE} (\textbf{Mi}xture of \textbf{D}ual \textbf{R}easoning \textbf{E}xperts), which integrates an internal reasoning expert for detecting incongruities within the image-text pair and an external reasoning expert that utilizes structured rationales generated via Chain-of-Thought prompting to a Large Vision-Language Model. An adaptive gating mechanism dynamically weighs the two experts, selecting the most relevant reasoning path. Unlike prior methods that treat external knowledge as static input, MiDRE selectively adapts to when such knowledge is beneficial, mitigating the risks of hallucinated or irrelevant signals from large models. Experiments on two benchmark datasets show that MiDRE achieves superior performance over baselines. Various qualitative analyses highlight the crucial role of external rationales, revealing that even when they are occasionally noisy, they provide valuable cues that guide the model toward a better understanding of sarcasm.
翻译:随着社交媒体上多媒体帖文的兴起,多模态讽刺检测引起了日益增长的关注。理解讽刺性的图文帖文通常需要外部上下文知识,例如文化背景或常识推理。然而,现有模型主要依赖浅层线索(如图像描述或图像中的对象-属性对),难以捕捉讽刺背后的深层逻辑。为此,我们提出\\textbf{MiDRE}(\\textbf{Mi}xture of \\textbf{D}ual \\textbf{R}easoning \\textbf{E}xperts,双重推理专家混合模型),该模型整合了一个内部推理专家(用于检测图文对内部的不一致性)和一个外部推理专家(利用通过思维链提示向大型视觉语言模型生成的结构化推理依据)。自适应门控机制动态权衡两个专家,选择最相关的推理路径。与先前将外部知识视为静态输入的方法不同,MiDRE选择性地适应此类知识何时有益,从而减轻大型模型产生幻觉或无关信号的风险。在两个基准数据集上的实验表明,MiDRE实现了优于基线的性能。多种定性分析突显了外部推理依据的关键作用,表明即使它们偶尔存在噪声,也能提供有价值的线索,引导模型更好地理解讽刺。