With the increasing prevalence of synthetic images, evaluating image authenticity and locating forgeries accurately while maintaining human interpretability remains a challenging task. Existing detection models primarily focus on simple authenticity classification, ultimately providing only a forgery probability or binary judgment, which offers limited explanatory insights into image authenticity. Moreover, while MLLM-based detection methods can provide more interpretable results, they still lag behind expert models in terms of pure authenticity classification accuracy. To address this, we propose DF-LLaVA, a simple yet effective framework that unlocks the intrinsic discrimination potential of MLLMs. Our approach first extracts latent knowledge from MLLMs and then injects it into training via prompts. This framework allows LLaVA to achieve outstanding detection accuracy exceeding expert models while still maintaining the interpretability offered by MLLMs. Extensive experiments confirm the superiority of our DF-LLaVA, achieving both high accuracy and explainability in synthetic image detection. Code is available online at: https://github.com/Eliot-Shen/DF-LLaVA.
翻译:随着合成图像日益普遍,在保持人类可解释性的同时准确评估图像真实性并定位伪造区域,仍然是一项具有挑战性的任务。现有的检测模型主要侧重于简单的真实性分类,最终仅提供伪造概率或二元判断,这对图像真实性的解释性洞察有限。此外,尽管基于MLLM的检测方法能够提供更具可解释性的结果,但在纯粹的真实性分类准确率方面仍落后于专家模型。为解决这一问题,我们提出了DF-LLaVA,这是一个简单而有效的框架,旨在释放MLLM固有的判别潜力。我们的方法首先从MLLM中提取潜在知识,然后通过提示将其注入训练过程。该框架使LLaVA能够实现超越专家模型的出色检测准确率,同时仍保持MLLM所提供的可解释性。大量实验证实了我们的DF-LLaVA的优越性,在合成图像检测中同时实现了高准确率与可解释性。代码发布于:https://github.com/Eliot-Shen/DF-LLaVA。