Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language. While these models have numerous benefits across various sectors, they have also raised concerns about the potential misuse of fake images and cast new pressures on fake image detection. In this work, we pioneer a systematic study on deepfake detection generated by state-of-the-art diffusion models. Firstly, we conduct a comprehensive analysis of the performance of contrastive and classification-based visual features, respectively extracted from CLIP-based models and ResNet or ViT-based architectures trained on image classification datasets. Our results demonstrate that fake images share common low-level cues, which render them easily recognizable. Further, we devise a multimodal setting wherein fake images are synthesized by different textual captions, which are used as seeds for a generator. Under this setting, we quantify the performance of fake detection strategies and introduce a contrastive-based disentangling method that lets us analyze the role of the semantics of textual descriptions and low-level perceptual cues. Finally, we release a new dataset, called COCOFake, containing about 1.2M images generated from the original COCO image-caption pairs using two recent text-to-image diffusion models, namely Stable Diffusion v1.4 and v2.0.
翻译:近期扩散模型的发展使得从自然语言的文本提示中生成逼真的深度伪造图像成为可能。尽管这些模型在多个领域带来了诸多益处,但它们也引发了关于伪造图像被滥用的担忧,并对伪造图像检测施加了新的压力。本研究首次对由最先进的扩散模型生成的深度伪造检测进行了系统性探索。首先,我们全面分析了分别从基于CLIP的模型以及基于ResNet或ViT架构在图像分类数据集上训练得到的对比学习与分类视觉特征的表现。结果表明,伪造图像共享常见的低级线索,使其易于识别。此外,我们设计了一种多模态设置,其中伪造图像由不同的文本描述(作为生成器的种子)合成。在此设置下,我们量化了伪造检测策略的性能,并引入了一种基于对比学习的解耦方法,用于分析文本描述语义与低级感知线索的作用。最后,我们发布了一个名为COCOFake的新数据集,该数据集包含约120万张图像,这些图像通过两种最新的文本到图像扩散模型(即Stable Diffusion v1.4和v2.0)从原始COCO图像-描述对生成。