The rise of multimodal misinformation on social platforms poses significant challenges for individuals and societies. Its increased credibility and broader impact compared to textual misinformation make detection complex, requiring robust reasoning across diverse media types and profound knowledge for accurate verification. The emergence of Large Vision Language Model (LVLM) offers a potential solution to this problem. Leveraging their proficiency in processing visual and textual information, LVLM demonstrates promising capabilities in recognizing complex information and exhibiting strong reasoning skills. In this paper, we first investigate the potential of LVLM on multimodal misinformation detection. We find that even though LVLM has a superior performance compared to LLMs, its profound reasoning may present limited power with a lack of evidence. Based on these observations, we propose LEMMA: LVLM-Enhanced Multimodal Misinformation Detection with External Knowledge Augmentation. LEMMA leverages LVLM intuition and reasoning capabilities while augmenting them with external knowledge to enhance the accuracy of misinformation detection. Our method improves the accuracy over the top baseline LVLM by 7% and 13% on Twitter and Fakeddit datasets respectively.
翻译:社交媒体平台上多模态虚假信息的泛滥对个人及社会构成了重大挑战。相较于纯文本虚假信息,多模态信息具有更高的可信度与更广泛的影响,其检测需要跨多种媒体类型的强健推理能力以及深厚的专业知识以实现准确验证。大视觉语言模型(LVLM)的出现为解决该问题提供了潜在方案。凭借其在处理视觉与文本信息方面的专长,LVLM在识别复杂信息与展现强推理能力上表现出巨大潜力。本文首先探究了LVLM在多模态虚假信息检测中的潜力,发现尽管LVLM的性能优于大语言模型(LLM),但其深度推理可能因缺乏证据支撑而受限。基于此观察,我们提出LEMMA:基于LVLM增强与外部知识扩充的多模态虚假信息检测方法。LEMMA利用LVLM的直觉与推理能力,同时通过外部知识增强以提升虚假信息检测的准确性。本方法在Twitter与Fakeddit数据集上分别比最优基线LVLM提升了7%与13%的检测准确率。