Large vision-language models (LVLMs) have significantly improved multimodal reasoning tasks, such as visual question answering and image captioning. These models embed multimodal facts within their parameters, rather than relying on external knowledge bases to store factual information explicitly. However, the content discerned by LVLMs may deviate from actual facts due to inherent bias or incorrect inference. To address this issue, we introduce MFC-Bench, a rigorous and comprehensive benchmark designed to evaluate the factual accuracy of LVLMs across three stages of verdict prediction for MFC: Manipulation, Out-of-Context, and Veracity Classification. Through our evaluation on MFC-Bench, we benchmarked a dozen diverse and representative LVLMs, uncovering that current models still fall short in multimodal fact-checking and demonstrate insensitivity to various forms of manipulated content. We hope that MFC-Bench could raise attention to the trustworthy AI potentially assisted by LVLMs in the future. The MFC-Bench and accompanying resources are publicly accessible at https://github.com/wskbest/MFC-Bench, contributing to ongoing research in the multimodal fact-checking field.
翻译:大型视觉-语言模型(LVLMs)在视觉问答和图像描述等多模态推理任务上取得了显著进展。这些模型将多模态事实内嵌于其参数中,而非依赖外部知识库显式存储事实信息。然而,由于固有偏见或错误推理,LVLMs所识别的内容可能与实际事实存在偏差。为解决此问题,我们提出了MFC-Bench——一个严谨而全面的基准测试,旨在评估LVLMs在多模态事实核查(MFC)的三个判定阶段(即篡改检测、上下文脱离检测与真实性分类)中的事实准确性。通过在MFC-Bench上的评估,我们对十余种具有代表性和多样性的LVLMs进行了基准测试,发现当前模型在多模态事实核查方面仍存在不足,且对各类篡改内容表现不敏感。我们希望MFC-Bench能引起学界对未来由LVLMs辅助实现可信人工智能的关注。MFC-Bench及相关资源已公开于https://github.com/wskbest/MFC-Bench,以促进多模态事实核查领域的持续研究。