Current evaluations of mathematical reasoning in large language models (LLMs) are dominated by static benchmarks, either derived from competition-style problems or curated through costly expert effort, resulting in limited coverage of research-level mathematics and rapid performance saturation. We propose a fully automated, theorem-grounded pipeline for evaluating frontier mathematical reasoning, which directly transforms recent peer-reviewed mathematical literature into executable and verifiable reasoning tasks. The pipeline identifies constructive or quantitative results, instantiates them into parameterized problem templates, and generates deterministic solutions through execution-based verification, enabling scalable, reproducible, and continuously updatable evaluation without reliance on large-scale expert authoring. By design, this approach supports temporal extensibility, intrinsic correctness checking, and domain-specific customization across mathematical subfields. Applying this pipeline yields \textbf{EternalMath}, an evolving evaluation suite derived from contemporary research papers. Experiments with state-of-the-art LLMs reveal substantial performance gaps, indicating that mathematical reasoning at the research frontier remains far from saturated and underscoring the need for evaluation methodologies that evolve in step with human mathematical discovery.
翻译:当前大型语言模型(LLM)的数学推理评估主要依赖于静态基准,这些基准要么源自竞赛风格问题,要么通过昂贵的专家努力进行人工构建,导致对研究级数学的覆盖范围有限且性能迅速饱和。我们提出了一种完全自动化、基于定理的数学前沿推理评估流程,该流程直接将近期同行评审的数学文献转化为可执行且可验证的推理任务。该流程识别构造性或定量结果,将其实例化为参数化问题模板,并通过基于执行的验证生成确定性解,从而在不依赖大规模专家编写的情况下实现可扩展、可复现且持续可更新的评估。通过设计,该方法支持时间可扩展性、内在正确性检查以及跨数学子领域的特定领域定制。应用此流程产生了\textbf{EternalMath}——一个源自当代研究论文的演进式评估套件。对前沿LLM的实验揭示了显著的性能差距,表明研究前沿的数学推理远未达到饱和,并强调了评估方法需要与人类数学发现同步演进的重要性。