Social media abounds with multimodal sarcasm, and identifying sarcasm targets is particularly challenging due to the implicit incongruity not directly evident in the text and image modalities. Current methods for Multimodal Sarcasm Target Identification (MSTI) predominantly focus on superficial indicators in an end-to-end manner, overlooking the nuanced understanding of multimodal sarcasm conveyed through both the text and image. This paper proposes a versatile MSTI framework with a coarse-to-fine paradigm, by augmenting sarcasm explainability with reasoning and pre-training knowledge. Inspired by the powerful capacity of Large Multimodal Models (LMMs) on multimodal reasoning, we first engage LMMs to generate competing rationales for coarser-grained pre-training of a small language model on multimodal sarcasm detection. We then propose fine-tuning the model for finer-grained sarcasm target identification. Our framework is thus empowered to adeptly unveil the intricate targets within multimodal sarcasm and mitigate the negative impact posed by potential noise inherently in LMMs. Experimental results demonstrate that our model far outperforms state-of-the-art MSTI methods, and markedly exhibits explainability in deciphering sarcasm as well.
翻译:社交媒体中充斥着多模态讽刺,由于文本和图像模态中未直接显现的隐式不协调性,识别讽刺目标尤为困难。当前多模态讽刺目标识别(MSTI)方法主要采用端到端方式关注表层指标,忽视了通过文本和图像传递的多模态讽刺的细微理解。本文提出一种基于从粗到细范式的通用MSTI框架,通过推理与预训练知识增强讽刺可解释性。受大规模多模态模型(LMMs)在多模态推理中的强大能力启发,我们首先引导LMMs生成竞争性解释,用于对小型语言模型进行粗粒度多模态讽刺检测的预训练,进而提出微调该模型以实现细粒度讽刺目标识别。该框架能够精准揭示多模态讽刺中的复杂目标,并有效缓解LMMs中固有潜在噪声的负面影响。实验结果表明,我们的模型远优于现有最先进MSTI方法,并在解析讽刺方面展现出显著的可解释性。