Current multimodal misinformation detection (MMD) methods often assume a single source and type of forgery for each sample, which is insufficient for real-world scenarios where multiple forgery sources coexist. The lack of a benchmark for mixed-source misinformation has hindered progress in this field. To address this, we introduce MMFakeBench, the first comprehensive benchmark for mixed-source MMD. MMFakeBench includes 3 critical sources: textual veracity distortion, visual veracity distortion, and cross-modal consistency distortion, along with 12 sub-categories of misinformation forgery types. We further conduct an extensive evaluation of 6 prevalent detection methods and 15 large vision-language models (LVLMs) on MMFakeBench under a zero-shot setting. The results indicate that current methods struggle under this challenging and realistic mixed-source MMD setting. Additionally, we propose an innovative unified framework, which integrates rationales, actions, and tool-use capabilities of LVLM agents, significantly enhancing accuracy and generalization. We believe this study will catalyze future research into more realistic mixed-source multimodal misinformation and provide a fair evaluation of misinformation detection methods.
翻译:当前的多模态虚假信息检测方法通常假设每个样本仅包含单一来源和类型的伪造,这不足以应对现实场景中多种伪造来源并存的情况。缺乏针对混合来源虚假信息的基准阻碍了该领域的进展。为此,我们提出了MMFakeBench,首个面向混合来源多模态虚假信息检测的综合基准。MMFakeBench涵盖三个关键来源:文本真实性扭曲、视觉真实性扭曲以及跨模态一致性扭曲,并包含12个子类别的虚假信息伪造类型。我们进一步在零样本设置下,对6种主流检测方法和15个大型视觉语言模型在MMFakeBench上进行了广泛评估。结果表明,现有方法在这一具有挑战性且贴近现实的混合来源多模态虚假信息检测场景中表现不佳。此外,我们提出了一种创新的统一框架,该框架整合了LVLM智能体的推理依据、行动能力与工具使用能力,显著提升了检测准确性与泛化性能。我们相信这项研究将推动未来对更贴近现实的混合来源多模态虚假信息的探索,并为虚假信息检测方法提供公正的评估基准。