Multimodal fake news detection is crucial for mitigating adversarial misinformation. Existing methods, relying on static fusion or LLMs, face computational redundancy and hallucination risks due to weak visual foundations. To address this, we propose DIVER (Dynamic Iterative Visual Evidence Reasoning), a framework grounded in a progressive, evidence-driven reasoning paradigm. DIVER first establishes a strong text-based baseline through language analysis, leveraging intra-modal consistency to filter unreliable or hallucinated claims. Only when textual evidence is insufficient does the framework introduce visual information, where inter-modal alignment verification adaptively determines whether deeper visual inspection is necessary. For samples exhibiting significant cross-modal semantic discrepancies, DIVER selectively invokes fine-grained visual tools (e.g., OCR and dense captioning) to extract task-relevant evidence, which is iteratively aggregated via uncertainty-aware fusion to refine multimodal reasoning. Experiments on Weibo, Weibo21, and GossipCop demonstrate that DIVER outperforms state-of-the-art baselines by an average of 2.72\%, while optimizing inference efficiency with a reduced latency of 4.12 s.
翻译:多模态虚假新闻检测对于缓解对抗性虚假信息至关重要。现有方法依赖于静态融合或大语言模型,由于视觉基础薄弱而面临计算冗余和幻觉风险。为解决此问题,我们提出了DIVER(动态迭代视觉证据推理),这是一个基于渐进式、证据驱动推理范式的框架。DIVER首先通过语言分析建立一个强大的基于文本的基线,利用模态内一致性来过滤不可靠或产生幻觉的声明。仅当文本证据不足时,该框架才引入视觉信息,其中模态间对齐验证自适应地决定是否需要更深层次的视觉检查。对于表现出显著跨模态语义差异的样本,DIVER选择性地调用细粒度视觉工具(例如OCR和密集字幕生成)来提取任务相关证据,这些证据通过不确定性感知融合进行迭代聚合,以优化多模态推理。在Weibo、Weibo21和GossipCop数据集上的实验表明,DIVER平均优于最先进的基线方法2.72%,同时通过减少4.12秒的延迟优化了推理效率。