How well are open sourced AI-generated image detection models out-of-the-box: A comprehensive benchmark study

As AI-generated images proliferate across digital platforms, reliable detection methods have become critical for combating misinformation and maintaining content authenticity. While numerous deepfake detection methods have been proposed, existing benchmarks predominantly evaluate fine-tuned models, leaving a critical gap in understanding out-of-the-box performance -- the most common deployment scenario for practitioners. We present the first comprehensive zero-shot evaluation of 16 state-of-the-art detection methods, comprising 23 pretrained detector variants (due to multiple released versions of certain detectors), across 12 diverse datasets, comprising 2.6~million image samples spanning 291 unique generators including modern diffusion models. Our systematic analysis reveals striking findings: (1)~no universal winner exists, with detector rankings exhibiting substantial instability (Spearman~$ρ$: 0.01 -- 0.87 across dataset pairs); (2)~a 37~percentage-point performance gap separates the best detector (75.0\% mean accuracy) from the worst (37.5\%); (3)~training data alignment critically impacts generalization, causing up to 20--60\% performance variance within architecturally identical detector families; (4)~modern commercial generators (Flux~Dev, Firefly~v4, Midjourney~v7) defeat most detectors, achieving only 18--30\% average accuracy; and (5)~we identify three systematic failure patterns affecting cross-dataset generalization. Statistical analysis confirms significant performance differences between detectors (Friedman test: $χ^2$=121.01, $p<10^{-16}$, Kendall~$W$=0.524). Our findings challenge the ``one-size-fits-all'' detector paradigm and provide actionable deployment guidelines, demonstrating that practitioners must carefully select detectors based on their specific threat landscape rather than relying on published benchmark performance.

翻译：随着AI生成图像在数字平台上的激增，可靠的检测方法已成为打击虚假信息和维护内容真实性的关键。尽管已有大量深度伪造检测方法被提出，现有基准主要评估微调模型，导致对即用性能——实践者最常见的部署场景——的理解存在关键空白。我们首次对16种最先进的检测方法（包含23个预训练检测器变体，因部分检测器存在多个发布版本）进行了全面的零样本评估，覆盖12个多样化数据集，共计260万张图像样本，涵盖291个独特生成器（包括现代扩散模型）。我们的系统分析揭示了以下显著发现：(1)不存在通用最优检测器，检测器排名表现出显著不稳定性（Spearman~$ρ$：0.01 -- 0.87，跨数据集对）；(2)最佳检测器（平均准确率75.0%）与最差检测器（37.5%）之间存在37个百分点的性能差距；(3)训练数据对齐对泛化能力具有关键影响，导致架构相同的检测器家族内部出现高达20--60%的性能波动；(4)现代商业生成器（Flux~Dev、Firefly~v4、Midjourney~v7）能够规避大多数检测器，仅获得18--30%的平均准确率；(5)我们识别出三种影响跨数据集泛化的系统性失效模式。统计分析证实检测器之间存在显著性能差异（Friedman检验：$χ^2$=121.01，$p<10^{-16}$，Kendall~$W$=0.524）。我们的研究结果挑战了“一刀切”的检测器范式，提供了可操作的部署指南，表明实践者必须根据具体威胁环境谨慎选择检测器，而非依赖已发布的基准性能。