证明还是虚张声势？评估大型语言模型在2025年美国数学奥林匹克竞赛中的表现 (Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad)

Recent math benchmarks for large language models (LLMs) such as MathArena indicate that state-of-the-art reasoning models achieve impressive performance on mathematical competitions like AIME, with the leading model, o3-mini, achieving scores comparable to top human competitors. However, these benchmarks evaluate models solely based on final numerical answers, neglecting rigorous reasoning and proof generation which are essential for real-world mathematical tasks. To address this, we introduce the first comprehensive evaluation of full-solution reasoning for challenging mathematical problems. Using expert human annotators, we evaluated several state-of-the-art reasoning models on the six problems from the 2025 USAMO within hours of their release. Our results reveal that all tested models struggled significantly, achieving less than 5% on average. Through detailed analysis of reasoning traces, we identify the most common failure modes and find several unwanted artifacts arising from the optimization strategies employed during model training. Overall, our results suggest that current LLMs are inadequate for rigorous mathematical reasoning tasks, highlighting the need for substantial improvements in reasoning and proof generation capabilities.

翻译：近期针对大型语言模型（LLMs）的数学基准测试（如MathArena）表明，最先进的推理模型在AIME等数学竞赛中取得了令人印象深刻的成绩，其中领先模型o3-mini的得分已可与顶尖人类选手相媲美。然而，这些基准测试仅依据最终数值答案评估模型，忽略了严谨的推理与证明生成过程——而这在实际数学任务中至关重要。为此，我们首次针对挑战性数学问题开展了全解题推理能力的系统性评估。借助专家人工标注，我们在2025年美国数学奥林匹克竞赛（USAMO）试题公布数小时内，对多个前沿推理模型进行了六道赛题的测试。结果显示，所有受测模型均表现不佳，平均得分低于5%。通过对推理轨迹的细致分析，我们识别出最常见的失败模式，并发现模型训练中采用的优化策略产生了若干不良伪影。总体而言，我们的研究表明当前大型语言模型尚无法胜任严谨的数学推理任务，这凸显了在推理与证明生成能力方面仍需实质性突破。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日