Multimodal large language models have unlocked new possibilities for various multimodal tasks. However, their potential in image manipulation detection remains unexplored. When directly applied to the IMD task, M-LLMs often produce reasoning texts that suffer from hallucinations and overthinking. To address this, we propose ForgerySleuth, which leverages M-LLMs to perform comprehensive clue fusion and generate segmentation outputs indicating specific regions that are tampered with. Moreover, we construct the ForgeryAnalysis dataset through the Chain-of-Clues prompt, which includes analysis and reasoning text to upgrade the image manipulation detection task. A data engine is also introduced to build a larger-scale dataset for the pre-training phase. Our extensive experiments demonstrate the effectiveness of ForgeryAnalysis and show that ForgerySleuth significantly outperforms existing methods in generalization, robustness, and explainability.
翻译:多模态大语言模型为各类多模态任务开启了新的可能性。然而,其在图像篡改检测领域的潜力尚未得到探索。当直接将M-LLMs应用于IMD任务时,其生成的推理文本常存在幻觉和过度思考的问题。为解决此问题,我们提出了ForgerySleuth,它利用M-LLMs进行全面的线索融合,并生成指示具体被篡改区域的分割输出。此外,我们通过线索链提示构建了ForgeryAnalysis数据集,该数据集包含分析与推理文本,以升级图像篡改检测任务。我们还引入了一个数据引擎,用于在预训练阶段构建更大规模的数据集。我们的大量实验证明了ForgeryAnalysis的有效性,并表明ForgerySleuth在泛化性、鲁棒性和可解释性方面显著优于现有方法。