We present a new approach for benchmarking Large Language Model (LLM) capabilities on research-level mathematics. Existing benchmarks largely rely on static, hand-curated sets of contest or textbook-style problems as proxies for mathematical research. Instead, we establish an updatable benchmark evaluating models directly on the latest research results in mathematics. This consists of an automatic pipeline that extracts lemmas from arXiv and rewrites them into self-contained statements by making all assumptions and required definitions explicit. It results in a benchmark that can be updated regularly with new problems taken directly from human mathematical research, while previous instances can be used for training without compromising future evaluations. We benchmark current state-of-the-art LLMs, which obtain around 10-15$\%$ accuracy in theorem proving (pass@1) depending on the model, showing that there is currently a large margin of progression for LLMs to reach human-level proving capabilities in a research context.
翻译:我们提出了一种评估大语言模型在数学研究领域能力的新基准方法。现有基准主要依赖静态、人工筛选的竞赛或教科书风格问题作为数学研究的替代评估。与之不同,我们建立了一个可直接基于数学领域最新研究成果进行模型评估的可更新基准。该基准通过自动化流程实现:从arXiv提取引理,并通过显式说明所有假设和必要定义将其重写为自包含的命题陈述。由此产生的基准能够定期更新,直接采用人类数学研究中的新问题,同时既往题目可用于训练而不会影响未来评估。我们对当前最先进的大语言模型进行了基准测试,结果显示在定理证明任务中(pass@1)准确率约为10-15%(具体取决于模型),这表明LLM在研究语境下要达到人类水平的证明能力仍存在巨大提升空间。