原型性偏差揭示多模态评估指标的盲区 (Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics)

Automatic metrics are now central to evaluating text-to-image models, often substituting for human judgment in benchmarking and large-scale filtering. However, it remains unclear whether these metrics truly prioritize semantic correctness or instead favor visually and socially prototypical images learned from biased data distributions. We identify and study prototypicality bias as a systematic failure mode in multimodal evaluation. We introduce a controlled contrastive benchmark ProtoBias (Prototypical Bias), spanning Animals, Objects, and Demography images, where semantically correct but non-prototypical images are paired with subtly incorrect yet prototypical adversarial counterparts. This setup enables a directional evaluation of whether metrics follow textual semantics or default to prototypes. Our results show that widely used metrics, including CLIPScore, PickScore, and VQA-based scores, frequently misrank these pairs, while even LLM-as-Judge systems exhibit uneven robustness in socially grounded cases. Human evaluations consistently favour semantic correctness with larger decision margins. Motivated by these findings, we propose ProtoScore, a robust 7B-parameter metric that substantially reduces failure rates and suppresses misranking, while running at orders of magnitude faster than the inference time of GPT-5, approaching the robustness of much larger closed-source judges.

翻译：自动评估指标现已成为评估文本到图像模型的核心工具，常在基准测试和大规模筛选中替代人类判断。然而，这些指标究竟是在真正优先考虑语义正确性，还是仅仅青睐从有偏数据分布中学到的视觉与社会原型图像，目前尚不明确。我们识别并研究了原型性偏差作为多模态评估中的一种系统性失效模式。我们引入了一个受控对比基准ProtoBias（原型性偏差），涵盖动物、物体和人口统计图像类别，其中语义正确但非原型的图像与轻微错误但却是原型的对抗性对应图像配对。这种设置能够定向评估指标是遵循文本语义还是默认选择原型。我们的结果表明，广泛使用的指标（包括CLIPScore、PickScore和基于VQA的评分）经常对这些配对进行错误排序，而即使是LLM-as-Judge系统在社会情境案例中也表现出不均衡的鲁棒性。人类评估则始终更倾向于语义正确性，且决策边界更大。基于这些发现，我们提出了ProtoScore，这是一个鲁棒的70亿参数指标，能显著降低失效率并抑制错误排序，同时其运行速度比GPT-5的推理时间快数个数量级，接近更大规模闭源评估系统的鲁棒性水平。