DatBench：判别性、忠实且高效的视觉语言模型评估 (DatBench: Discriminative, Faithful, and Efficient VLM Evaluations)

Siddharth Joshi,Haoli Yin,Rishabh Adiga,Ricardo Monti,Aldo Carranza,Alex Fang,Alvin Deng,Amro Abbas,Brett Larsen,Cody Blakeney,Darren Teh,David Schwab,Fan Pan,Haakon Mongstad,Jack Urbanek,Jason Lee,Jason Telanoff,Josh Wills,Kaleigh Mentzer,Luke Merrick,Parth Doshi,Paul Burstein,Pratyush Maini,Scott Loftin,Spandan Das,Tony Jiang,Vineeth Dorna,Zhengping Wang,Bogdan Gaza,Ari Morcos,Matthew Leavitt

Empirical evaluation serves as the primary compass guiding research progress in foundation models. Despite a large body of work focused on training frontier vision-language models (VLMs), approaches to their evaluation remain nascent. To guide their maturation, we propose three desiderata that evaluations should satisfy: (1) faithfulness to the modality and application, (2) discriminability between models of varying quality, and (3) efficiency in compute. Through this lens, we identify critical failure modes that violate faithfulness and discriminability, misrepresenting model capabilities: (i) multiple-choice formats reward guessing, poorly reflect downstream use cases, and saturate early as models improve; (ii) blindly solvable questions, which can be answered without images, constitute up to 70% of some evaluations; and (iii) mislabeled or ambiguous samples compromise up to 42% of examples in certain datasets. Regarding efficiency, the computational burden of evaluating frontier models has become prohibitive: by some accounts, nearly 20% of development compute is devoted to evaluation alone. Rather than discarding existing benchmarks, we curate them via transformation and filtering to maximize fidelity and discriminability. We find that converting multiple-choice questions to generative tasks reveals sharp capability drops of up to 35%. In addition, filtering blindly solvable and mislabeled samples improves discriminative power while simultaneously reducing computational cost. We release DatBench-Full, a cleaned evaluation suite of 33 datasets spanning nine VLM capabilities, and DatBench, a discriminative subset that achieves 13x average speedup (up to 50x) while closely matching the discriminative power of the original datasets. Our work outlines a path toward evaluation practices that are both rigorous and sustainable as VLMs continue to scale.

翻译：实证评估是引导基础模型研究进展的主要指南针。尽管已有大量工作专注于训练前沿视觉语言模型（VLMs），但其评估方法仍处于初级阶段。为引导其成熟发展，我们提出了评估应满足的三个理想特性：（1）对模态和应用的忠实性；（2）对质量不同模型的判别能力；（3）计算效率。通过这一视角，我们识别出违反忠实性和判别性的关键失效模式，这些模式会扭曲模型能力的真实表现：（i）多项选择形式会奖励猜测行为，难以反映下游实际用例，且随着模型改进会过早饱和；（ii）可盲目解答的问题（无需图像即可回答）在某些评估中占比高达70%；（iii）错误标注或模糊样本在某些数据集中可影响高达42%的示例。关于效率问题，评估前沿模型的计算负担已变得难以承受：据某些估算，近20%的开发算力仅用于评估。我们并未抛弃现有基准，而是通过转换和筛选对其进行优化，以最大化保真度与判别力。研究发现，将多项选择题转换为生成式任务可揭示高达35%的显著能力下降。此外，过滤可盲目解答及错误标注的样本能在提升判别能力的同时降低计算成本。我们发布了DatBench-Full——一个涵盖九项VLM能力的33个数据集的净化评估套件，以及DatBench——一个判别性子集，其在保持与原始数据集相近判别能力的同时实现了13倍（最高50倍）的平均加速。本研究为VLM持续扩展时代下的严谨且可持续的评估实践指明了发展路径。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

在无标注条件下适配视觉—语言模型：全面综述

专知会员服务

13+阅读 · 2025年8月9日

面向视觉语言模型的持续学习：遗忘之外的综述与分类体系

专知会员服务

21+阅读 · 2025年8月9日

视觉-语言模型在物体检测与分割中的应用：综述与评估

专知会员服务

25+阅读 · 2025年4月28日

高效视觉语言模型研究综述

专知会员服务

14+阅读 · 2025年4月18日