Rich feature representations derived from CLIP-ViT have been widely utilized in AI-generated image detection. While most existing methods primarily leverage features from the final layer, we systematically analyze the contributions of layer-wise features to this task. Our study reveals that earlier layers provide more localized and generalizable features, often surpassing the performance of final-layer features in detection tasks. Moreover, we find that different layers capture distinct aspects of the data, each contributing uniquely to AI-generated image detection. Motivated by these findings, we introduce a novel adaptive method, termed MoLD, which dynamically integrates features from multiple ViT layers using a gating-based mechanism. Extensive experiments on both GAN- and diffusion-generated images demonstrate that MoLD significantly improves detection performance, enhances generalization across diverse generative models, and exhibits robustness in real-world scenarios. Finally, we illustrate the scalability and versatility of our approach by successfully applying it to other pre-trained ViTs, such as DINOv2.
翻译:源自CLIP-ViT的丰富特征表示已广泛应用于AI生成图像检测领域。尽管现有方法主要利用最终层的特征,我们系统性地分析了各层级特征对此任务的贡献。研究发现,较早层级提供更具局部性和泛化能力的特征,在检测任务中往往超越最终层特征的性能。此外,我们发现不同层级捕获数据的不同方面,各自对AI生成图像检测做出独特贡献。基于这些发现,我们提出一种新颖的自适应方法MoLD,该方法通过基于门控的机制动态整合来自多个ViT层级的特征。在GAN生成和扩散生成图像上的大量实验表明,MoLD显著提升了检测性能,增强了对不同生成模型的泛化能力,并在实际场景中展现出鲁棒性。最后,我们通过将该方法成功应用于其他预训练ViT(如DINOv2),展示了其可扩展性和多功能性。