The recent advancements in Generative Adversarial Networks (GANs) and the emergence of Diffusion models have significantly streamlined the production of highly realistic and widely accessible synthetic content. As a result, there is a pressing need for effective general purpose detection mechanisms to mitigate the potential risks posed by deepfakes. In this paper, we explore the effectiveness of pre-trained vision-language models (VLMs) when paired with recent adaptation methods for universal deepfake detection. Following previous studies in this domain, we employ only a single dataset (ProGAN) in order to adapt CLIP for deepfake detection. However, in contrast to prior research, which rely solely on the visual part of CLIP while ignoring its textual component, our analysis reveals that retaining the text part is crucial. Consequently, the simple and lightweight Prompt Tuning based adaptation strategy that we employ outperforms the previous SOTA approach by 5.01% mAP and 6.61% accuracy while utilizing less than one third of the training data (200k images as compared to 720k). To assess the real-world applicability of our proposed models, we conduct a comprehensive evaluation across various scenarios. This involves rigorous testing on images sourced from 21 distinct datasets, including those generated by GANs-based, Diffusion-based and Commercial tools.
翻译:近年来,生成对抗网络(GANs)的进展与扩散模型的出现显著简化了高逼真度、广泛可获取的合成内容的生成流程。因此,亟需开发有效的通用检测机制,以降低深度伪造带来的潜在风险。本文探究预训练视觉-语言模型(VLMs)结合最新适配方法在通用深度伪造检测中的有效性。遵循该领域先前研究,我们仅使用单一数据集(ProGAN)来适配CLIP模型用于深度伪造检测。然而,与先前仅依赖CLIP视觉部分而忽略其文本组件的研究不同,我们的分析表明保留文本部分至关重要。基于此,我们采用的轻量级提示调优(Prompt Tuning)适配策略在仅使用不到三分之一的训练数据(20万张图像对比72万张)的情况下,在mAP和准确率上分别超越先前最佳方法5.01%和6.61%。为评估所提模型的实际应用性,我们针对多种场景进行了全面评估,涉及对来自21个不同数据集的图像进行严格测试,这些数据集包括基于GANs、扩散模型及商业工具生成的图像。