Influential benchmarks incentivize competing model developers to strategically allocate post-training resources toward improvements on the leaderboard, a phenomenon dubbed benchmaxxing or training on the test task. In this work, we initiate a principled study of the incentive structure that benchmarks induce. We model benchmarking as a Stackelberg game between a benchmark designer who chooses an evaluation protocol and multiple model developers who compete simultaneously in a subgame given by the designer's choice. Each competitor has a model of unknown latent quality and can inflate its observed score by allocating resources to benchmark-specific improvements. First, we prove that current benchmarks induce games for which no Nash equilibrium between model developers exists. This result suggests one explanation for why current practice leads to misaligned incentives, prompting model developers to strategize in opaque ways. However, we prove that under mild conditions, a recently proposed evaluation protocol, called tune-before-test, induces a benchmark with a unique Nash equilibrium that ranks models by latent quality. This positive result demonstrates that benchmarks need not set bad incentives, even if current evaluations do.
翻译:有影响力的基准测试会激励相互竞争的模型开发者策略性地分配后训练资源以提升其在排行榜上的表现,这种现象被称为"基准测试最大化"或"针对测试任务的训练"。本研究首次对基准测试所引发的激励结构展开系统性研究。我们将基准测试建模为一个斯坦克尔伯格博弈:基准设计者选择评估协议,而多个模型开发者则在设计者给定的子博弈中同时竞争。每位竞争者拥有一个潜在质量未知的模型,并可通过分配资源进行基准特定改进来提升其观测分数。首先,我们证明当前基准测试所诱导的博弈不存在模型开发者之间的纳什均衡。这一结果揭示了当前实践导致激励错配的原因,促使模型开发者以不透明的方式进行策略调整。然而,我们证明在温和条件下,近期提出的"测试前调优"评估协议能够诱导出具有唯一纳什均衡的基准测试,该均衡能按潜在质量对模型进行排序。这一积极结果表明,即使当前评估方式存在缺陷,基准测试本身并不必然设置不良激励。