The rapid progress of frontier large language models has led to widespread benchmark saturation, limiting the ability of existing datasets to differentiate model capabilities or provide useful training signal. For instance, on LiveCodeBench, frontier models achieve over 99% Pass@1 on easy splits and exceed 90% Pass@1 on average across difficulty levels. Constructing new, challenging datasets typically requires substantial human effort, creating a bottleneck for progress. We introduce BenchEvolver, a solution-centric evolutionary framework that automatically transforms existing coding problems into harder variants. Rather than generating problems from scratch, BenchEvolver evolves reference solutions through structured transformations and derives corresponding statements and tests from the evolved solutions. This design grounds generation in executable semantics, enabling scalable construction of high-quality, diverse, and difficult tasks with verifiable correctness. Applying BenchEvolver to LiveCodeBench and SciCode, we obtain evolved tasks that are substantially harder while maintaining validity, reference correctness, and diversity. We further curate LiveCodeBench-Plus, a 91-problem benchmark combining evolved and difficult original LCB-v6 tasks, where frontier-model Pass@1 ranges from 27.5% to 62.6%, restoring clear discrimination among strong coding models. Importantly, evolved tasks remain challenging even for the model that generates them, enabling self-improvement. We further show that RL on evolved LCB tasks improves held-out coding performance: for gpt-oss-20b, seed+evolved training achieves +8.7 and +8.3 Pass@1 gains on LCB v6 Hard and LCB-Pro Easy, exceeding seed-only gains by 70.7% and 34.8%, respectively. Our results show that BenchEvolver can convert saturated benchmarks into frontier-level evaluation suites and reusable training signal.
翻译:前沿大语言模型的快速发展导致现有基准测试普遍饱和,制约了数据集区分模型能力或提供有效训练信号的功能。例如,在LiveCodeBench上,前沿模型在简单子集上Pass@1超过99%,各难度级别的平均Pass@1也超过90%。构建新的挑战性数据集通常需要大量人工投入,这已成为进展的瓶颈。我们提出BenchEvolver,一种以解为中心的演化框架,可自动将现有编程问题转化为更难的变体。BenchEvolver并非从零生成问题,而是通过结构化变换演化参考解,并从演化后的解中推导出对应的问题描述与测试用例。该设计将生成过程锚定于可执行语义,从而能够大规模构建高质量、多样化且难度递增的可验证正确性任务。将BenchEvolver应用于LiveCodeBench和SciCode后,我们获得的任务在保持合法性、参考解正确性与多样性的同时,难度显著提升。我们进一步构建了包含91个问题的LiveCodeBench-Plus基准,融合了演化任务与原始LCB-v6困难任务,前沿模型在此基准上的Pass@1从27.5%到62.6%不等,据此恢复了强编码模型间的清晰区分能力。值得注意的是,即使对生成任务本身的前沿模型而言,演化后的任务仍具有挑战性,从而支持模型自我改进。我们进一步证明,在演化后的LCB任务上进行强化学习可提升模型在不可见编码任务上的表现:对于gpt-oss-20b模型,种子任务与演化任务的联合训练在LCB v6 Hard和LCB-Pro Easy上分别获得了+8.7和+8.3的Pass@1提升,相较仅使用种子任务训练分别高出70.7%和34.8%。实验结果表明,BenchEvolver能将饱和的基准测试转化为前沿级的评估套件与可复用的训练信号。