Multimodal sarcasm detection (MSD) aims to identify sarcasm within image-text pairs by modeling semantic incongruities across modalities. Existing methods often exploit cross-modal embedding misalignment to detect inconsistency but struggle when visual and textual content are loosely related or semantically indirect. While recent approaches leverage large language models (LLMs) to generate sarcastic cues, the inherent diversity and subjectivity of these generations often introduce noise. To address these limitations, we propose the Generative Discrepancy Comparison Network (GDCNet). This framework captures cross-modal conflicts by utilizing descriptive, factually grounded image captions generated by Multimodal LLMs (MLLMs) as stable semantic anchors. Specifically, GDCNet computes semantic and sentiment discrepancies between the generated objective description and the original text, alongside measuring visual-textual fidelity. These discrepancy features are then fused with visual and textual representations via a gated module to adaptively balance modality contributions. Extensive experiments on MSD benchmarks demonstrate GDCNet's superior accuracy and robustness, establishing a new state-of-the-art on the MMSD2.0 benchmark.
翻译:多模态讽刺检测(MSD)旨在通过建模跨模态语义不一致性来识别图像-文本对中的讽刺内容。现有方法通常利用跨模态嵌入错位来检测不一致性,但当视觉与文本内容关联松散或语义间接时效果不佳。尽管近期研究利用大语言模型(LLM)生成讽刺线索,但这些生成结果固有的多样性和主观性常引入噪声。为克服这些局限,我们提出生成式差异比较网络(GDCNet)。该框架通过使用多模态大语言模型(MLLM)生成的描述性、事实基础的图像标题作为稳定语义锚点,以捕捉跨模态冲突。具体而言,GDCNet计算生成的客观描述与原始文本之间的语义及情感差异,同时测量视觉-文本保真度。这些差异特征随后通过门控模块与视觉、文本表征融合,以自适应平衡模态贡献。在MSD基准上的大量实验表明,GDCNet在准确性与鲁棒性上均显著优于现有方法,并在MMSD2.0基准上实现了新的最优性能。