AI evaluations are shifting toward harder tasks that benefit from longer trajectories involving tool use and iterative problem solving. As a result, performance is increasingly sensitive to the amount and allocation of compute available at test time ("inference compute"). Yet many evaluations still report performance at a single restrictive budget, meaning that low scores may reflect the evaluation setup rather than the model's underlying capability. To test this, we evaluate up to 12 frontier language models on seven challenging benchmarks spanning software engineering, mathematics, medicine, and cybersecurity. We use a controlled setup combining three simple inference-scaling interventions: larger token budgets, context compaction, and repeated submission attempts, guided either by the model itself or by minimal correctness feedback. We find three main results. First, larger token budgets substantially improve performance on benchmarks across multiple domains, including cybersecurity, FrontierMath, Humanity's Last Exam, and TerminalBench. Second, fixed-budget evaluations can increasingly understate frontier capability as models advance. Newer models reach higher performance at large budgets, where they unlock harder tasks and solve them more reliably. Third, benchmarks differ in which inference-scaling methods help most: repeated submission broadly improves performance, but the value of larger token budgets, external feedback, and parallel attempts varies by benchmark. Overall, our results show that benchmark scores are protocol-dependent. We therefore argue that evaluations should report capability as a function of inference-time compute, specify protocol choices explicitly, and compare model generations over a large shared compute range at matched budgets, especially in safety- or policy-relevant settings.
翻译:人工智能评估正转向更困难的任务,这些任务受益于涉及工具使用和迭代问题求解的更长轨迹。因此,性能日益对测试时可用的计算量及分配(“推理计算”)敏感。然而,许多评估仍仅报告单一限制性预算下的性能,这意味着低分可能反映的是评估设定,而非模型的底层能力。为验证这一点,我们在涵盖软件工程、数学、医学和网络安全的七个具有挑战性的基准上,对多达12个前沿语言模型进行了评估。我们采用一种受控设定,结合三种简单的推理缩放干预措施:更大的令牌预算、上下文压缩以及重复提交尝试,这些尝试由模型自身或最简正确性反馈引导。我们得到三个主要结果。第一,更大的令牌预算显著提升了跨多个领域基准的性能,包括网络安全、FrontierMath、Humanity's Last Exam 和 TerminalBench。第二,随着模型进步,固定预算评估会日益低估前沿能力。较新的模型在大预算下达到更高性能,在此条件下它们能解锁更难的任务并更可靠地解决它们。第三,不同基准中,最有效的推理缩放方法各异:重复提交能广泛提升性能,但更大的令牌预算、外部反馈和并行尝试的价值因基准而异。总体而言,我们的结果表明基准分数具有协议依赖性。因此,我们主张评估应将能力报告为推理时间计算的函数,明确指定协议选择,并在匹配预算下比较同一广泛计算范围内多代模型,尤其是在涉及安全或政策的场景中。