Despite progress in multimodal sarcasm detection, existing datasets and methods predominantly focus on single-image scenarios, overlooking potential semantic and affective relations across multiple images. This leaves a gap in modeling cases where sarcasm is triggered by multi-image cues in real-world settings. To bridge this gap, we introduce MMSD3.0, a new benchmark composed entirely of multi-image samples curated from tweets and Amazon reviews. We further propose the Cross-Image Reasoning Model (CIRM), which performs targeted cross-image sequence modeling to capture latent inter-image connections. In addition, we introduce a relevance-guided, fine-grained cross-modal fusion mechanism based on text-image correspondence to reduce information loss during integration. We establish a comprehensive suite of strong and representative baselines and conduct extensive experiments, showing that MMSD3.0 is an effective and reliable benchmark that better reflects real-world conditions. Moreover, CIRM demonstrates state-of-the-art performance across MMSD, MMSD2.0 and MMSD3.0, validating its effectiveness in both single-image and multi-image scenarios.
翻译:尽管多模态讽刺检测已取得进展,但现有的数据集和方法主要集中于单图像场景,忽略了多幅图像之间潜在的语义和情感关联。这导致在建模真实场景中由多图像线索触发的讽刺案例时存在空白。为填补这一空白,我们提出了MMSD3.0——一个完全由来自推文和亚马逊评论的多图像样本构成的新基准数据集。我们进一步提出了跨图像推理模型(CIRM),该模型执行有针对性的跨图像序列建模以捕捉潜在的图像间关联。此外,我们引入了一种基于图文对应关系的、相关性引导的细粒度跨模态融合机制,以减少信息在整合过程中的损失。我们建立了一套全面且具有代表性的强基线模型,并进行了大量实验,结果表明MMSD3.0是一个有效且可靠的基准,能更好地反映真实世界条件。此外,CIRM在MMSD、MMSD2.0和MMSD3.0上均展现了最先进的性能,验证了其在单图像和多图像场景中的有效性。