As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. We introduce MathDuels, a self-play benchmark in which models occupy dual roles: each authors math problems under adversarial prompting and solves problems authored by every other participant. Problems are produced through a three-stage generation pipeline (meta-prompting, problem generation, and difficulty amplification), and validated by an independent verifier that excludes ill-posed questions. A Rasch model (Rasch, 1993) jointly estimates solver abilities and problem difficulties; author quality is derived from the difficulties of each model's authored problems. Experiments across 19 frontier models reveal that authoring and solving capabilities are partially decoupled, and that dual-role evaluation reveals capability separations invisible in single-role benchmarks. As newer models enter the arena, they produce problems that defeat previously dominant solvers, so the benchmark's difficulty co-evolves with participant strength rather than saturating at a fixed ceiling. We host a public leaderboard that updates as new models are released.
翻译:随着前沿语言模型在静态数学基准测试中达到近乎天花板的表现,现有评估方法已难以区分模型能力差异,这主要是因为该类评估仅将模型视为固定问题集的求解者。我们提出MathDuels——一个自我对战式基准测试,其中模型承担双重角色:各模型在对抗性提示下编写数学问题,并求解其他参与者编写的问题。问题通过三阶段生成流水线生成(元提示、问题生成与难度放大),并由独立验证器排除病态问题。采用Rasch模型(Rasch, 1993)联合估计求解者能力与问题难度;作者质量则由各模型所编写问题的难度推导得出。针对19个前沿模型的实验表明,问题编写与求解能力部分解耦,且双角色评估揭示了单角色基准测试中无法观测的能力差异。当新模型进入竞技场时,它们会生成击败先前主导求解者的问题,因此基准测试的难度会随参与者能力的提升而共同进化,而非在固定天花板处饱和。我们维护一个公开排行榜,随着新模型的发布同步更新。