Benchmarks are central to measuring the capabilities of large language models and guiding model development, yet widespread data leakage from pretraining corpora undermines their validity. Models can match memorized content rather than demonstrate true generalization, which inflates scores, distorts cross-model comparisons, and misrepresents progress. We introduce ArenaBencher, a model-agnostic framework for automatic benchmark evolution that updates test cases while preserving comparability. Given an existing benchmark and a diverse pool of models to be evaluated, ArenaBencher infers the core ability of each test case, generates candidate question-answer pairs that preserve the original objective, verifies correctness and intent with an LLM as a judge, and aggregates feedback from multiple models to select candidates that expose shared weaknesses. The process runs iteratively with in-context demonstrations that steer generation toward more challenging and diagnostic cases. We apply ArenaBencher to math problem solving, commonsense reasoning, and safety domains and show that it produces verified, diverse, and fair updates that uncover new failure modes, increase difficulty while preserving test objective alignment, and improve model separability. The framework provides a scalable path to continuously evolve benchmarks in step with the rapid progress of foundation models.
翻译:基准对于衡量大语言模型的能力和指导模型发展至关重要,然而预训练语料中普遍存在的数据泄露问题削弱了其有效性。模型可能通过匹配记忆内容而非展现真正的泛化能力来获得高分,这不仅夸大了评分结果,扭曲了跨模型比较,还误导了对进展的评估。我们提出了ArenaBencher,一个模型无关的自动基准演化框架,能够在保持可比性的同时更新测试用例。给定现有基准和待评估的多样化模型池,ArenaBencher能够推断每个测试用例的核心能力,生成保持原始目标的问题-答案对候选集,通过LLM作为评判者验证其正确性和意图,并聚合多个模型的反馈以选择能暴露共同弱点的候选案例。该过程通过上下文示例进行迭代运行,引导生成更具挑战性和诊断性的案例。我们将ArenaBencher应用于数学问题求解、常识推理和安全领域,结果表明它能产生经过验证、多样化且公平的基准更新,这些更新能够揭示新的失效模式,在保持测试目标一致性的同时提升难度,并改善模型区分度。该框架为持续演进基准提供了一条可扩展的路径,使其能够与基础模型的快速发展保持同步。