Vision Language Models (VLMs) are increasingly used for detecting AI-generated images (AIGI). However, converting VLMs into reliable detectors is resource-intensive, and the resulting models often suffer from hallucination and poor generalization. To investigate the root cause, we conduct an empirical analysis and identify two consistent behaviors. First, fine-tuning VLMs with semantic supervision improves semantic discrimination and generalizes well to unseen data. Second, fine-tuning VLMs with pixel-artifact supervision leads to weak generalization. These findings reveal a fundamental task-model misalignment. VLMs are optimized for high-level semantic reasoning and lack inductive bias toward low-level pixel artifacts. In contrast, conventional vision models effectively capture pixel-level artifacts but are less sensitive to semantic inconsistencies. This indicates that different models are naturally suited to different subtasks. Based on this insight, we formulate AIGI detection as two orthogonal subtasks: semantic consistency checking and pixel-artifact detection. Neglecting either subtask leads to systematic detection failures. We further propose the Task-Model Alignment principle and instantiate it in a two-branch detector, AlignGemini. The detector combines a VLM trained with pure semantic supervision and a vision model trained with pure pixel-artifact supervision. By enforcing clear specialization, each branch captures complementary cues. Experiments on in-the-wild benchmarks show that AlignGemini improves average accuracy by 9.5 percent using simplified training data. These results demonstrate that task-model alignment is an effective principle for generalizable AIGI detection.
翻译:视觉语言模型(VLMs)越来越多地用于检测AI生成图像(AIGI)。然而,将VLMs转化为可靠的检测器需要大量资源,且所得模型常出现幻觉现象和泛化能力差的问题。为探究根本原因,我们进行了实证分析并识别出两种一致行为。首先,使用语义监督对VLMs进行微调可提升语义判别能力,并能良好泛化至未见数据。其次,使用像素伪影监督对VLMs进行微调会导致弱泛化能力。这些发现揭示了根本性的任务-模型错配问题:VLMs针对高层语义推理进行优化,缺乏对低层像素伪影的归纳偏置;而传统视觉模型能有效捕捉像素级伪影,但对语义不一致性较不敏感。这表明不同模型天然适用于不同子任务。基于此洞见,我们将AIGI检测形式化为两个正交子任务:语义一致性检验与像素伪影检测。忽略任一子任务都会导致系统性检测失败。我们进一步提出任务-模型对齐原则,并在双分支检测器AlignGemini中实现该原则。该检测器结合了纯语义监督训练的VLM与纯像素伪影监督训练的视觉模型。通过强制明确的任务专精,每个分支能捕捉互补线索。在真实场景基准测试上的实验表明,AlignGemini使用简化训练数据可将平均准确率提升9.5%。这些结果证明任务-模型对齐是实现可泛化AIGI检测的有效原则。