The aim of this work is to explore the potential of pre-trained vision-language models (VLMs) for universal detection of AI-generated images. We develop a lightweight detection strategy based on CLIP features and study its performance in a wide variety of challenging scenarios. We find that, contrary to previous beliefs, it is neither necessary nor convenient to use a large domain-specific dataset for training. On the contrary, by using only a handful of example images from a single generative model, a CLIP-based detector exhibits surprising generalization ability and high robustness across different architectures, including recent commercial tools such as Dalle-3, Midjourney v5, and Firefly. We match the state-of-the-art (SoTA) on in-distribution data and significantly improve upon it in terms of generalization to out-of-distribution data (+6% AUC) and robustness to impaired/laundered data (+13%). Our project is available at https://grip-unina.github.io/ClipBased-SyntheticImageDetection/
翻译:本文旨在探索预训练视觉-语言模型(VLM)在通用检测AI生成图像方面的潜力。我们开发了一种基于CLIP特征的轻量级检测策略,并在多种具有挑战性的场景下研究了其性能。研究发现,与先前观点相反,使用大规模领域特定数据集进行训练既非必要也不便利。相反,仅使用来自单个生成模型的少量示例图像,基于CLIP的检测器便展现出惊人的泛化能力和对不同架构(包括近期商业工具如Dalle-3、Midjourney v5和Firefly)的高鲁棒性。我们在同分布数据上达到了当前最优(SoTA)水平,并在对非同分布数据的泛化能力(AUC提升+6%)以及对受损/净化数据的鲁棒性(+13%)方面显著超越了现有方法。我们的项目代码可从https://grip-unina.github.io/ClipBased-SyntheticImageDetection/获取。