Recent years have witnessed the sustained evolution of misinformation that aims at manipulating public opinions. Unlike traditional rumors or fake news editors who mainly rely on generated and/or counterfeited images, text and videos, current misinformation creators now more tend to use out-of-context multimedia contents (e.g. mismatched images and captions) to deceive the public and fake news detection systems. This new type of misinformation increases the difficulty of not only detection but also clarification, because every individual modality is close enough to true information. To address this challenge, in this paper we explore how to achieve interpretable cross-modal de-contextualization detection that simultaneously identifies the mismatched pairs and the cross-modal contradictions, which is helpful for fact-check websites to document clarifications. The proposed model first symbolically disassembles the text-modality information to a set of fact queries based on the Abstract Meaning Representation of the caption and then forwards the query-image pairs into a pre-trained large vision-language model select the ``evidences" that are helpful for us to detect misinformation. Extensive experiments indicate that the proposed methodology can provide us with much more interpretable predictions while maintaining the accuracy same as the state-of-the-art model on this task.
翻译:近年来,旨在操纵公众舆论的虚假信息持续演变。与传统谣言或假新闻编辑主要依赖生成/伪造的图像、文本和视频不同,当前的虚假信息制造者更倾向于使用脱离上下文的多媒体内容(如图文不匹配的配对)来欺骗公众和假新闻检测系统。这种新型虚假信息不仅增加了检测难度,更提高了澄清难度,因为每种单一模态都与真实信息高度相似。为应对这一挑战,本文探索如何实现可解释的跨模态去语境化检测,同时识别图文不匹配对和跨模态矛盾信息,这对事实核查网站的澄清记录具有重要价值。所提模型首先基于标题的抽象意义表示(Abstract Meaning Representation)将文本模态信息符号化解构为一组事实查询,随后将查询-图像对输入预训练的大规模视觉-语言模型,选取有助于检测虚假信息的"证据"。大量实验表明,该方法在保持与当前最优模型相同检测精度的同时,能提供更具可解释性的预测结果。