Vision-language models (VLMs) are increasingly used to detect whether AI-generated images contain visible artifacts, yet their ability to analyze such artifacts remains poorly understood. A correct image-level decision can still hide important failures: a model may correctly flag an artifact while relying on the wrong visual cue, selecting the wrong region, or describing a defect that the image does not support. To evaluate these behaviors directly, we introduce SalArt-VQA, a diagnostic benchmark for fine-grained SALient ARTifact understanding in AI-generated images. SalArt-VQA contains 950 images and 3,681 human-authored multiple-choice questions spanning artifact images, matched real reference images, and paired generated reference images. Four aligned question types evaluate presence detection, semantic localization, spatial grounding, and evidence-grounded defect identification, while the reference splits test calibration and abstention when the annotated defect is absent. Across 20 VLMs, SalArt-VQA reveals failures that image-level detection accuracy hides: the strongest model reaches 99.37% detection recall on artifact images but answers all four artifact-side questions correctly on only 53.26% of images. Comparing artifact images with artifact-free references reveals a sensitivity-calibration tradeoff: sensitive models often make unsupported artifact claims, while conservative models avoid false alarms largely by missing real artifacts. These results show that high artifact detection accuracy alone does not imply grounded artifact understanding. SalArt-VQA exposes these hidden failure modes and provides a fine-grained evaluation of whether VLM artifact claims are supported by local visual evidence.
翻译:视觉语言模型(VLM)正被越来越多地用于检测AI生成图像是否包含可见伪影,然而它们分析此类伪影的能力仍不明确。一个正确的图像级决策仍可能掩盖重要的失败:模型可能正确标记伪影,却依赖错误视觉线索、选择错误区域或描述图像中不存在的缺陷。为直接评估这些行为,我们提出SalArt-VQA,一个面向AI生成图像中细粒度显著伪影理解的诊断基准。SalArt-VQA包含950张图像和3,681道人工编写选择题,涵盖伪影图像、匹配的真实参考图像及成对生成的参考图像。四个对齐的问题类型评估存在性检测、语义定位、空间锚定及基于证据的缺陷识别,而参考数据分割则测试当标注缺陷不存在时的校准与弃权能力。在20个VLM上的实验表明,SalArt-VQA揭示了图像级检测准确性所掩盖的失败:最强模型在伪影图像上达到99.37%的检测召回率,但仅在53.26%的图像上正确回答所有四个伪影侧问题。将伪影图像与无伪影参考图像对比,揭示出敏感性-校准权衡:敏感模型常提出无依据的伪影断言,而保守模型主要通过遗漏真实伪影来避免虚警。这些结果表明,高伪影检测准确性本身并不等于对伪影的扎实理解。SalArt-VQA暴露了这些隐藏的失败模式,并提供了一种细粒度评估,用以判断VLM的伪影断言是否得到局部视觉证据的支持。