Recent math benchmarks for large language models (LLMs) such as MathArena indicate that state-of-the-art reasoning models achieve impressive performance on mathematical competitions like AIME, with the leading model, o3-mini, achieving scores comparable to top human competitors. However, these benchmarks evaluate models solely based on final numerical answers, neglecting rigorous reasoning and proof generation which are essential for real-world mathematical tasks. To address this, we introduce the first comprehensive evaluation of full-solution reasoning for challenging mathematical problems. Using expert human annotators, we evaluated several state-of-the-art reasoning models on the six problems from the 2025 USAMO within hours of their release. Our results reveal that all tested models struggled significantly, achieving less than 5% on average. Through detailed analysis of reasoning traces, we identify the most common failure modes and find several unwanted artifacts arising from the optimization strategies employed during model training. Overall, our results suggest that current LLMs are inadequate for rigorous mathematical reasoning tasks, highlighting the need for substantial improvements in reasoning and proof generation capabilities.
翻译:近期针对大型语言模型(LLMs)的数学基准测试(如MathArena)表明,最先进的推理模型在AIME等数学竞赛中取得了令人印象深刻的成绩,其中领先模型o3-mini的得分已可与顶尖人类选手相媲美。然而,这些基准测试仅依据最终数值答案评估模型,忽略了严谨的推理与证明生成过程——而这在实际数学任务中至关重要。为此,我们首次针对挑战性数学问题开展了全解题推理能力的系统性评估。借助专家人工标注,我们在2025年美国数学奥林匹克竞赛(USAMO)试题公布数小时内,对多个前沿推理模型进行了六道赛题的测试。结果显示,所有受测模型均表现不佳,平均得分低于5%。通过对推理轨迹的细致分析,我们识别出最常见的失败模式,并发现模型训练中采用的优化策略产生了若干不良伪影。总体而言,我们的研究表明当前大型语言模型尚无法胜任严谨的数学推理任务,这凸显了在推理与证明生成能力方面仍需实质性突破。