Social media abounds with multimodal sarcasm, and identifying sarcasm targets is particularly challenging due to the implicit incongruity not directly evident in the text and image modalities. Current methods for Multimodal Sarcasm Target Identification (MSTI) predominantly focus on superficial indicators in an end-to-end manner, overlooking the nuanced understanding of multimodal sarcasm conveyed through both the text and image. This paper proposes a versatile MSTI framework with a coarse-to-fine paradigm, by augmenting sarcasm explainability with reasoning and pre-training knowledge. Inspired by the powerful capacity of Large Multimodal Models (LMMs) on multimodal reasoning, we first engage LMMs to generate competing rationales for coarser-grained pre-training of a small language model on multimodal sarcasm detection. We then propose fine-tuning the model for finer-grained sarcasm target identification. Our framework is thus empowered to adeptly unveil the intricate targets within multimodal sarcasm and mitigate the negative impact posed by potential noise inherently in LMMs. Experimental results demonstrate that our model far outperforms state-of-the-art MSTI methods, and markedly exhibits explainability in deciphering sarcasm as well.
翻译:社交媒体充斥着多模态讽刺表达,由于文本和图像模态中隐含的不一致并非直接显著,识别讽刺目标尤为困难。当前多模态讽刺目标识别(MSTI)方法主要采用端到端方式关注浅层指标,忽视了文本与图像共同传递的多模态讽刺的 nuanced 理解。本文提出一种采用粗到细范式的通用MSTI框架,通过推理与预训练知识增强讽刺可解释性。受大规模多模态模型(LMMs)在多模态推理上的强大能力启发,我们首先引导LMMs生成竞争性解释,用于对小型语言模型进行多模态讽刺检测的粗粒度预训练;随后提出对模型进行细粒度讽刺目标识别的微调。由此,我们的框架能够精准揭示多模态讽刺中的复杂目标,并缓解LMMs固有潜在噪声带来的负面影响。实验结果表明,我们的模型远优于现有最先进的MSTI方法,并在解析讽刺时显著展现出可解释性。