Empirical evaluation serves as the primary compass guiding research progress in foundation models. Despite a large body of work focused on training frontier vision-language models (VLMs), approaches to their evaluation remain nascent. To guide their maturation, we propose three desiderata that evaluations should satisfy: (1) faithfulness to the modality and application, (2) discriminability between models of varying quality, and (3) efficiency in compute. Through this lens, we identify critical failure modes that violate faithfulness and discriminability, misrepresenting model capabilities: (i) multiple-choice formats reward guessing, poorly reflect downstream use cases, and saturate early as models improve; (ii) blindly solvable questions, which can be answered without images, constitute up to 70% of some evaluations; and (iii) mislabeled or ambiguous samples compromise up to 42% of examples in certain datasets. Regarding efficiency, the computational burden of evaluating frontier models has become prohibitive: by some accounts, nearly 20% of development compute is devoted to evaluation alone. Rather than discarding existing benchmarks, we curate them via transformation and filtering to maximize fidelity and discriminability. We find that converting multiple-choice questions to generative tasks reveals sharp capability drops of up to 35%. In addition, filtering blindly solvable and mislabeled samples improves discriminative power while simultaneously reducing computational cost. We release DatBench-Full, a cleaned evaluation suite of 33 datasets spanning nine VLM capabilities, and DatBench, a discriminative subset that achieves 13x average speedup (up to 50x) while closely matching the discriminative power of the original datasets. Our work outlines a path toward evaluation practices that are both rigorous and sustainable as VLMs continue to scale.
翻译:实证评估是引导基础模型研究进展的主要指南针。尽管已有大量工作专注于训练前沿视觉语言模型(VLMs),但其评估方法仍处于初级阶段。为引导其成熟发展,我们提出了评估应满足的三个理想特性:(1)对模态和应用的忠实性;(2)对质量不同模型的判别能力;(3)计算效率。通过这一视角,我们识别出违反忠实性和判别性的关键失效模式,这些模式会扭曲模型能力的真实表现:(i)多项选择形式会奖励猜测行为,难以反映下游实际用例,且随着模型改进会过早饱和;(ii)可盲目解答的问题(无需图像即可回答)在某些评估中占比高达70%;(iii)错误标注或模糊样本在某些数据集中可影响高达42%的示例。关于效率问题,评估前沿模型的计算负担已变得难以承受:据某些估算,近20%的开发算力仅用于评估。我们并未抛弃现有基准,而是通过转换和筛选对其进行优化,以最大化保真度与判别力。研究发现,将多项选择题转换为生成式任务可揭示高达35%的显著能力下降。此外,过滤可盲目解答及错误标注的样本能在提升判别能力的同时降低计算成本。我们发布了DatBench-Full——一个涵盖九项VLM能力的33个数据集的净化评估套件,以及DatBench——一个判别性子集,其在保持与原始数据集相近判别能力的同时实现了13倍(最高50倍)的平均加速。本研究为VLM持续扩展时代下的严谨且可持续的评估实践指明了发展路径。