The prevalence of sarcasm in social media, conveyed through text-image combinations, presents significant challenges for sentiment analysis and intention mining. Current multi-modal sarcasm detection methods have been proven to struggle with biases from spurious cues, leading to a superficial understanding of the complex interactions between text and image. To address these issues, we propose InterCLIP-MEP, a robust framework for multi-modal sarcasm detection. InterCLIP-MEP introduces a refined variant of CLIP, Interactive CLIP (InterCLIP), as the backbone, enhancing sample representations by embedding cross-modality information in each encoder. Furthermore, a novel training strategy is designed to adapt InterCLIP for a Memory-Enhanced Predictor (MEP). MEP uses dynamic dual-channel memory to store valuable historical knowledge of test samples and then leverages this memory as a non-parametric classifier to derive the final prediction. By using InterCLIP to encode text-image interactions more effectively and incorporating MEP, InterCLIP-MEP offers a more robust recognition of multi-modal sarcasm. Experiments demonstrate that InterCLIP-MEP achieves state-of-the-art performance on the MMSD2.0 benchmark. Code and data are available at https://github.com/CoderChen01/InterCLIP-MEP.
翻译:社交媒体中通过图文组合传达的讽刺现象普遍存在,这对情感分析和意图挖掘提出了重大挑战。当前的多模态讽刺检测方法已被证明难以处理虚假线索带来的偏差,导致对文本与图像间复杂交互的理解流于表面。为解决这些问题,我们提出了InterCLIP-MEP——一个鲁棒的多模态讽刺检测框架。InterCLIP-MEP引入CLIP的改进变体,即交互式CLIP(InterCLIP)作为主干网络,通过在每个编码器中嵌入跨模态信息来增强样本表示。此外,我们设计了一种新颖的训练策略,使InterCLIP适配于记忆增强预测器(MEP)。MEP利用动态双通道存储器存储测试样本的宝贵历史知识,随后将该存储器作为非参数分类器以得出最终预测。通过使用InterCLIP更有效地编码文本-图像交互,并结合MEP,InterCLIP-MEP实现了对多模态讽刺更鲁棒的识别。实验表明,InterCLIP-MEP在MMSD2.0基准测试中取得了最先进的性能。代码与数据可在 https://github.com/CoderChen01/InterCLIP-MEP 获取。