Online misinformation is often multimodal in nature, i.e., it is caused by misleading associations between texts and accompanying images. To support the fact-checking process, researchers have been recently developing automatic multimodal methods that gather and analyze external information, evidence, related to the image-text pairs under examination. However, prior works assumed all external information collected from the web to be relevant. In this study, we introduce a "Relevant Evidence Detection" (RED) module to discern whether each piece of evidence is relevant, to support or refute the claim. Specifically, we develop the "Relevant Evidence Detection Directed Transformer" (RED-DOT) and explore multiple architectural variants (e.g., single or dual-stage) and mechanisms (e.g., "guided attention"). Extensive ablation and comparative experiments demonstrate that RED-DOT achieves significant improvements over the state-of-the-art (SotA) on the VERITE benchmark by up to 33.7%. Furthermore, our evidence re-ranking and element-wise modality fusion led to RED-DOT surpassing the SotA on NewsCLIPings+ by up to 3% without the need for numerous evidence or multiple backbone encoders. We release our code at: https://github.com/stevejpapad/relevant-evidence-detection
翻译:在线虚假信息通常具有多模态性质,即由文本与配图之间的误导性关联所引发。为支持事实核查流程,研究者近期开始开发自动多模态方法,用于收集并分析与被核查图像-文本对相关的外部信息与证据。然而,先前的工作默认从网络收集的所有外部信息均为相关。本研究引入"相关证据检测"(RED)模块,用于判断每条证据是否相关,以支持或反驳主张。具体而言,我们构建了"相关证据检测导向Transformer"(RED-DOT),并探索了多种架构变体(如单/双阶段)与机制(如"引导注意力")。大量消融实验与对比实验表明,RED-DOT在VERITE基准上相较于最先进方法(SotA)实现了高达33.7%的显著提升。此外,通过证据重排序与元素级模态融合,RED-DOT在不依赖大量证据或多骨干编码器的情况下,在NewsCLIPings+上超越SotA达3%。我们已开放代码:https://github.com/stevejpapad/relevant-evidence-detection