FakeBench: Uncover the Achilles' Heels of Fake Images with Large Multimodal Models

Recently, fake images generated by artificial intelligence (AI) models have become indistinguishable from the real, exerting new challenges for fake image detection models. To this extent, simple binary judgments of real or fake seem less convincing and credible due to the absence of human-understandable explanations. Fortunately, Large Multimodal Models (LMMs) bring possibilities to materialize the judgment process while their performance remains undetermined. Therefore, we propose FakeBench, the first-of-a-kind benchmark towards transparent defake, consisting of fake images with human language descriptions on forgery signs. FakeBench gropes for two open questions of LMMs: (1) can LMMs distinguish fake images generated by AI, and (2) how do LMMs distinguish fake images? In specific, we construct the FakeClass dataset with 6k diverse-sourced fake and real images, each equipped with a Question&Answer pair concerning the authenticity of images, which are utilized to benchmark the detection ability. To examine the reasoning and interpretation abilities of LMMs, we present the FakeClue dataset, consisting of 15k pieces of descriptions on the telltale clues revealing the falsification of fake images. Besides, we construct the FakeQA to measure the LMMs' open-question answering ability on fine-grained authenticity-relevant aspects. Our experimental results discover that current LMMs possess moderate identification ability, preliminary interpretation and reasoning ability, and passable open-question answering ability for image defake. The FakeBench will be made publicly available soon.

翻译：近年来，人工智能（AI）模型生成的伪造图像已足以以假乱真，这给伪造图像检测模型带来了全新挑战。在此背景下，由于缺乏人类可理解的解释机制，简单的“真/假”二元判断似乎缺乏说服力与可信度。幸运的是，大语言多模态模型（LMMs）为实现判断过程的可视化提供了可能性，但其性能尚未明确。为此，我们提出FakeBench——首个面向透明化辨伪的基准测试，包含带有伪造痕迹描述（基于人类语言）的伪造图像数据集。FakeBench致力于探究LMMs的两个开放性问题：（1）LMMs能否区分AI生成的伪造图像？（2）LMMs如何区分伪造图像？具体而言，我们构建了FakeClass数据集，包含6000张多源伪造与真实图像，每张图像配备一个关于真伪性的问答对，用于基准测试检测能力。为考察LMMs的推理与解释能力，我们提出FakeClue数据集，包含15000条揭示伪造图像破绽的关键线索描述。此外，我们构建了FakeQA数据集，用于评估LMMs在细粒度真伪相关维度上的开放问答能力。实验结果表明，当前LMMs具备中等水平的识别能力、初步的解释与推理能力，以及尚可接受的图像辨伪开放问答能力。FakeBench基准测试即将公开发布。