Recent advances in image generation have led to the widespread availability of highly realistic synthetic media, increasing the difficulty of reliable deepfake detection. A key challenge is generalization, as detectors trained on a narrow class of generators often fail when confronted with unseen models. In this work, we address the pressing need for generalizable detection by leveraging large vision-language models, specifically CLIP, to identify synthetic content across diverse generative techniques. First, we introduce Diff-Gen, a large-scale benchmark dataset comprising 100k diffusion-generated fakes that capture broad spectral artifacts unlike traditional GAN datasets. Models trained on Diff-Gen demonstrate stronger cross-domain generalization, particularly on previously unseen image generators. Second, we propose AdaptPrompt, a parameter-efficient transfer learning framework that jointly learns task-specific textual prompts and visual adapters while keeping the CLIP backbone frozen. We further show via layer ablation that pruning the final transformer block of the vision encoder enhances the retention of high-frequency generative artifacts, significantly boosting detection accuracy. Our evaluation spans 25 challenging test sets, covering synthetic content generated by GANs, diffusion models, and commercial tools, establishing a new state-of-the-art in both standard and cross-domain scenarios. We further demonstrate the framework's versatility through few-shot generalization (using as few as 320 images) and source attribution, enabling the precise identification of generator architectures in closed-set settings.
翻译:近年来图像生成技术的进步使得高度逼真的合成媒体广泛传播,这增加了可靠深度伪造检测的难度。其中关键挑战在于泛化性——在有限生成器类别上训练的检测器面对未见模型时往往失效。本研究通过利用大规模视觉语言模型(特别是CLIP)来识别跨多样生成技术的合成内容,以应对对泛化性检测的迫切需求。首先,我们提出Diff-Gen大规模基准数据集,该数据集包含10万个扩散生成伪造图像,其捕获的光谱伪影范围远超传统GAN数据集。在Diff-Gen上训练的模型展现出更强的跨域泛化能力,尤其在面对先前未见的图像生成器时表现突出。其次,我们提出AdaptPrompt参数高效迁移学习框架,该框架在冻结CLIP主干网络的同时,联合学习任务特定的文本提示与视觉适配器。我们通过层消融实验进一步证明:剪裁视觉编码器末层Transformer模块能增强高频生成伪影的保留,显著提升检测准确率。我们的评估涵盖25个具有挑战性的测试集,包含GAN、扩散模型及商业工具生成的合成内容,在标准场景与跨域场景中均确立了新的最优性能。最后,我们通过少样本泛化(仅使用320张图像)和生成源归因任务,验证了该框架在多场景下的适应性,在闭集设置中实现了对生成器架构的精确识别。