Recent AI systems have achieved gold-medal-level performance on the International Mathematical Olympiad, demonstrating remarkable proficiency at competition-style problem solving. However, competition mathematics represents only a narrow slice of mathematical reasoning: problems are drawn from limited domains, require minimal advanced machinery, and can often reward insightful tricks over deep theoretical knowledge. We introduce Riemann-Bench, a private benchmark of expert-curated problems designed to evaluate AI systems on research-level mathematics that goes far beyond the olympiad frontier. Problems are authored by Ivy League mathematics professors, graduate students, and PhD-holding IMO medalists, and routinely took their authors weeks to solve independently. Each problem undergoes double-blind verification by two independent domain experts who must solve the problem from scratch, and yields a unique, closed-form solution assessed by programmatic verifiers. We evaluate frontier models as unconstrained research agents, with full access to coding tools, search, and open-ended reasoning, using an unbiased statistical estimator computed over 100 independent runs per problem. Our results reveal that all frontier models currently score below 10%, exposing a substantial gap between olympiad-level problem solving and genuine research-level mathematical reasoning. By keeping the benchmark fully private, we ensure that measured performance reflects authentic mathematical capability rather than memorization of training data.
翻译:最近的 AI 系统在国际数学奥林匹克竞赛中达到了金牌级别的表现,展现出在竞赛风格解题方面卓越的能力。然而,竞赛数学仅代表了数学推理中狭窄的一部分:问题局限于有限领域,所需的高级机制极少,且常常奖励巧妙的技巧而非深厚的理论知识。我们提出 Riemann-Bench,这是一个由专家精心策划问题的私有基准测试,旨在评估 AI 系统在远超奥林匹克前沿的研究级数学上的能力。问题由常春藤盟校数学教授、研究生以及拥有博士学位的国际数学奥林匹克奖牌得主设计,其作者通常需要数周时间才能独立解决。每个问题都经过两位独立领域专家的双盲验证,他们必须从头开始解决问题,并通过程序化验证器获得唯一、封闭形式的解。我们将前沿模型评估为不受限制的研究智能体,拥有对编码工具、搜索和开放式推理的完全访问权限,并对每个问题在 100 次独立运行中采用无偏统计估计量进行计算。我们的结果显示,所有前沿模型目前得分低于 10%,揭示了奥林匹克级解题与真正研究级数学推理之间的巨大差距。通过完全保密基准测试,我们确保测量的性能反映真实的数学能力,而非训练数据的记忆。