Context-based detection methods such as DetectGPT achieve strong generalization in identifying AI-generated text by evaluating content compatibility with a model's learned distribution. In contrast, existing image detectors rely on discriminative features from pretrained backbones such as CLIP, which implicitly capture generator-specific artifacts. However, as modern generative models rapidly advance in visual fidelity, the artifacts these detectors depend on are becoming increasingly subtle or absent, undermining their reliability. Masked AutoEncoders (MAE) are inherently trained to reconstruct masked patches from visible context, naturally modeling patch-level contextual plausibility akin to conditional probability estimation, while also serving as a powerful semantic feature extractor through its encoder. We propose CINEMAE, a novel architecture that exploits both capabilities of MAE for AI-generated image detection: we derive per-patch anomaly signals from the reconstruction mechanism and extract global semantic features from the encoder, fusing both context-based and feature-based cues for robust detection. CINEMAE achieves highly competitive mean accuracies of 96.63\% on GenImage and 93.96\% on AIGCDetectBenchmark, maintaining over 93\% accuracy even under JPEG compression at QF=50.
翻译:基于上下文的检测方法(如DetectGPT)通过评估内容与模型学习分布的兼容性,在识别AI生成文本方面实现了强大的泛化能力。相比之下,现有的图像检测器依赖于预训练主干网络(如CLIP)提取的判别性特征,这些特征隐式地捕获了生成器特定的伪影。然而,随着现代生成模型在视觉保真度上的快速进步,这些检测器所依赖的伪影正变得越来越细微甚至消失,从而削弱了其可靠性。掩码自编码器(MAE)本质上通过可见上下文重建掩码图像块进行训练,自然地建模了类似于条件概率估计的图像块级上下文合理性,同时其编码器也可作为强大的语义特征提取器。我们提出CINEMAE这一新颖架构,它同时利用MAE的两种能力进行AI生成图像检测:从重建机制中提取逐像素块的异常信号,并从编码器中提取全局语义特征,融合基于上下文和基于特征的双重线索以实现鲁棒检测。CINEMAE在GenImage数据集上达到96.63%的平均准确率,在AIGCDetectBenchmark上达到93.96%的平均准确率,即使在QF=50的JPEG压缩条件下仍能保持超过93%的准确率。