Automatic metrics are now central to evaluating text-to-image models, often substituting for human judgment in benchmarking and large-scale filtering. However, it remains unclear whether these metrics truly prioritize semantic correctness or instead favor visually and socially prototypical images learned from biased data distributions. We identify and study \emph{prototypicality bias} as a systematic failure mode in multimodal evaluation. We introduce a controlled contrastive benchmark \textsc{\textbf{ProtoBias}} (\textit{\textbf{Proto}typical \textbf{Bias}}), spanning Animals, Objects, and Demography images, where semantically correct but non-prototypical images are paired with subtly incorrect yet prototypical adversarial counterparts. This setup enables a directional evaluation of whether metrics follow textual semantics or default to prototypes. Our results show that widely used metrics, including CLIPScore, PickScore, and VQA-based scores, frequently misrank these pairs, while even LLM-as-Judge systems exhibit uneven robustness in socially grounded cases. Human evaluations consistently favour semantic correctness with larger decision margins. Motivated by these findings, we propose \textbf{\textsc{ProtoScore}}, a robust 7B-parameter metric that substantially reduces failure rates and suppresses misranking, while running at orders of magnitude faster than the inference time of GPT-5, approaching the robustness of much larger closed-source judges.
翻译:自动评估指标现已成为评估文本到图像模型的核心工具,常在基准测试和大规模筛选中替代人类判断。然而,这些指标究竟是真正优先考虑语义正确性,还是偏向于从有偏数据分布中习得的视觉与社会原型图像,目前尚不明确。我们将原型性偏差识别并研究为多模态评估中的系统性失效模式。我们引入一个受控对比基准测试集ProtoBias(原型性偏差),涵盖动物、物体和人口统计图像类别,其中语义正确但非原型的图像与存在细微错误但具有原型性的对抗样本配对。该设置能够定向评估指标是遵循文本语义还是默认选择原型。我们的结果表明,包括CLIPScore、PickScore和基于VQA的评分在内的广泛使用的指标经常对这些配对产生误排序,而即使是LLM-as-Judge系统在社会情境案例中也表现出不均衡的鲁棒性。人类评估则始终以更大的决策边界倾向于语义正确性。基于这些发现,我们提出ProtoScore——一个鲁棒的70亿参数评估指标,它能显著降低失效率并抑制误排序,同时运行速度比GPT-5的推理时间快数个数量级,其鲁棒性接近规模更大的闭源评估系统。