Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

Guijin Son,Seungone Kim,Catherine Arnett,Hyunwoo Ko,Hyein Lee,Hyeonah Kang,Jiang Longxi,Jin Yun,JungYup Lee,Kyungmin Lee,Sam Yoosuk Kim,Sang Park,Seunghyeok Hong,SeungJae Lee,Seungyeop Yi,Shinae Shin,SunHye Bok,Sunyoung Shin,Yonghoon Ji,Youngtaek Kim,Hanearl Jung,Akari Asai,Graham Neubig,Sean Welleck,Youngjae Yu,Akshelin R,Alexander B. Ivanov,Boboev Muhammadjon,Chae Young Han,Christian Stump,Cooper R. Anderson,Dmitrii Karp,Dohyun Kwon,Dongryung Yi,DoYong Kwon,Duk-Soon Oh,Eunho Choi,Giovanni Resta,Greta Panova,Huiyun Noh,Hyungryul Baik,Hyungsun Bae,Inomov Mashrafdzhon,Jeewon Kim,Jeong-Rae Kim,Ji Eun Lee,Jiaqi Liu,Jieui Kang,Jimin Kim,Jon-Lark Kim,Joonyeong Won,Junseo Yoon,Junwoo Jo,Kibeom Kim,Kiwoon Kwon,Mario Kummer,Max Mercer,Min Hoon Kim,Minjun Kim,Nahyun Lee,Ng Ze-An,Nicolas Libedinsky,Rafał Marcin Łochowski,Raphaël Lachièze-Rey,Robert Auffarth,Ruichen Zhang,Sejin Park,Seonguk Seo,Shin Jaehoon, Sunatullo,Taewoong Eom,Yeachan Park,Yongseok Jang,Youchan Oh,Zhaoyang Wang,Zoltán Kovács

from arxiv, Under review, For questions or model-evaluation requests, contact [email protected]$

Following the recent achievement of gold-medal performance on the IMO by frontier LLMs, the community is searching for the next meaningful and challenging target for measuring LLM reasoning. Whereas olympiad-style problems measure step-by-step reasoning alone, research-level problems use such reasoning to advance the frontier of mathematical knowledge itself, emerging as a compelling alternative. Yet research-level math benchmarks remain scarce because such problems are difficult to source (e.g., Riemann Bench and FrontierMath-Tier 4 contain 25 and 50 problems, respectively). To support reliable evaluation of next-generation frontier models, we introduce Soohak, a 439-problem benchmark newly authored from scratch by 64 mathematicians. Soohak comprises two subsets. On the Challenge subset, frontier models including Gemini-3-Pro, GPT-5, and Claude-Opus-4.5 reach 30.4%, 26.4%, and 10.4% respectively, leaving substantial headroom, while leading open-weight models such as Qwen3-235B, GPT-OSS-120B, and Kimi-2.5 remain below 15%. Notably, beyond standard problem solving, Soohak introduces a refusal subset that probes a capability intrinsic to research mathematics: recognizing ill-posed problems and pausing rather than producing confident but unjustified answers. On this subset, no model exceeds 50%, identifying refusal as a new optimization target that current models do not directly address. To prevent contamination, the dataset will be publicly released in late 2026, with model evaluations available upon request in the interim.

翻译：继前沿大语言模型在国际数学奥林匹克竞赛中取得金牌级表现后，学界正致力于寻找下一个有意义且具挑战性的目标以衡量大语言模型的推理能力。奥林匹克竞赛类问题仅衡量逐步推理能力，而研究级问题则需运用此类推理推进数学知识前沿本身，正成为引人注目的替代方案。然而，研究级数学基准仍十分稀缺，因其问题难以获取（例如Riemann Bench与FrontierMath-Tier 4分别仅包含25个与50个问题）。为支撑下一代前沿模型的可靠评估，我们引入Soohak——一个由64位数学家从零全新构建的439道问题基准。Soohak包含两个子集。在Challenge子集中，包括Gemini-3-Pro、GPT-5与Claude-Opus-4.5在内的前沿模型分别达到30.4%、26.4%与10.4%的准确率，留有显著提升空间；而Qwen3-235B、GPT-OSS-120B与Kimi-2.5等领先开源权重模型则低于15%。值得注意的是，除标准解题能力外，Soohak引入了Refusal子集，该子集专门探测研究数学所固有的能力：识别不适定问题并暂停作答，而非给出自信但缺乏依据的回答。在此子集中，所有模型均未超过50%，由此将拒绝回答确立为当前模型尚未直接应对的优化新目标。为防止数据污染，该数据集将于2026年底公开，期间可申请获取模型评估结果。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

大语言模型基准综述

专知会员服务

27+阅读 · 2025年8月22日

【ICML2025】MARGE：通过引导式探索提升大型语言模型的数学推理能力

专知会员服务

9+阅读 · 2025年5月20日

OlymMATH: 奥林匹克级双语数学基准，R1 正确率仅为 21.2%

专知会员服务

11+阅读 · 2025年4月17日

如何提升大模型通用推理能力？DeepSeek最新论文《CODEI/O：通过代码输入输出预测凝练推理模式》

专知会员服务

42+阅读 · 2025年2月16日