Existing Multimodal Large Language Models (MLLMs) for image forgery detection and localization predominantly operate under a text-centric Chain-of-Thought (CoT) paradigm. However, forcing these models to textually characterize imperceptible low-level tampering traces inevitably leads to hallucinations, as linguistic modalities are insufficient to capture such fine-grained pixel-level inconsistencies. To overcome this, we propose ForgeryVCR, a framework that incorporates a forensic toolbox to materialize imperceptible traces into explicit visual intermediates via Visual-Centric Reasoning. To enable efficient tool utilization, we introduce a Strategic Tool Learning post-training paradigm, encompassing gain-driven trajectory construction for Supervised Fine-Tuning (SFT) and subsequent Reinforcement Learning (RL) optimization guided by a tool utility reward. This paradigm empowers the MLLM to act as a proactive decision-maker, learning to spontaneously invoke multi-view reasoning paths including local zoom-in for fine-grained inspection and the analysis of invisible inconsistencies in compression history, noise residuals, and frequency domains. Extensive experiments reveal that ForgeryVCR achieves state-of-the-art (SOTA) performance in both detection and localization tasks, demonstrating superior generalization and robustness with minimal tool redundancy. The project page is available at https://youqiwong.github.io/projects/ForgeryVCR/.
翻译:现有的用于图像伪造检测与定位的多模态大语言模型主要遵循以文本为中心的思维链范式。然而,迫使这些模型用文本描述难以察觉的低层次篡改痕迹,不可避免地会导致幻觉,因为语言模态不足以捕捉此类细粒度的像素级不一致性。为克服此问题,我们提出了ForgeryVCR框架,该框架通过整合一个取证工具箱,借助视觉中心推理将不可见的痕迹具体化为显式的视觉中间表示。为实现高效的工具利用,我们引入了一种策略性工具学习后训练范式,包括用于监督微调的增益驱动轨迹构建,以及后续由工具效用奖励引导的强化学习优化。该范式使多模态大语言模型能够作为主动的决策者,学习自发调用多视角推理路径,包括用于细粒度检查的局部放大,以及对压缩历史、噪声残差和频域中不可见不一致性的分析。大量实验表明,ForgeryVCR在检测与定位任务中均达到了最先进的性能,展现了卓越的泛化能力和鲁棒性,且工具冗余度极低。项目页面位于 https://youqiwong.github.io/projects/ForgeryVCR/。