In recent years, the rapid development of large reasoning models has resulted in the saturation of existing benchmarks for evaluating mathematical reasoning, highlighting the urgent need for more challenging and rigorous evaluation frameworks. To address this gap, we introduce OlymMATH, a novel Olympiad-level mathematical benchmark, designed to rigorously test the complex reasoning capabilities of LLMs. OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions. The problems are systematically organized into two distinct difficulty tiers: (1) AIME-level problems (easy) that establish a baseline for mathematical reasoning assessment, and (2) significantly more challenging problems (hard) designed to push the boundaries of current state-of-the-art models. In our benchmark, these problems span four core mathematical fields, each including a verifiable numerical solution to enable objective, rule-based evaluation. Empirical results underscore the significant challenge presented by OlymMATH, with state-of-the-art models including DeepSeek-R1 and OpenAI's o3-mini demonstrating notably limited accuracy on the hard subset. Furthermore, the benchmark facilitates comprehensive bilingual assessment of mathematical reasoning abilities-a critical dimension that remains largely unaddressed in mainstream mathematical reasoning benchmarks. We release the OlymMATH benchmark at the STILL project: https://github.com/RUCAIBox/Slow_Thinking_with_LLMs.
翻译:近年来,大规模推理模型的快速发展导致现有数学推理评估基准趋于饱和,凸显了对更具挑战性和严谨性评估框架的迫切需求。为填补这一空白,我们提出了OlymMATH——一个新颖的奥林匹克级数学基准,旨在严格测试大语言模型的复杂推理能力。OlymMATH包含200道精心设计的问题,每道题均经过人工验证,并提供并行的英文与中文版本。这些问题被系统性地组织为两个不同的难度层级:(1)AIME级别问题(简单),为数学推理评估建立基线;(2)显著更具挑战性的问题(困难),旨在突破当前最先进模型的能力边界。在我们的基准中,这些问题涵盖四个核心数学领域,每道题均包含可验证的数值解,以支持基于规则的客观评估。实证结果凸显了OlymMATH带来的重大挑战,包括DeepSeek-R1和OpenAI的o3-mini在内的最先进模型在困难子集上的准确率均表现出明显局限。此外,该基准支持对数学推理能力进行全面双语评估——这一关键维度在主流数学推理基准中仍未得到充分关注。我们已在STILL项目发布OlymMATH基准:https://github.com/RUCAIBox/Slow_Thinking_with_LLMs。