Following the recent achievement of gold-medal performance on the IMO by frontier LLMs, the community is searching for the next meaningful and challenging target for measuring LLM reasoning. Whereas olympiad-style problems measure step-by-step reasoning alone, research-level problems use such reasoning to advance the frontier of mathematical knowledge itself, emerging as a compelling alternative. Yet research-level math benchmarks remain scarce because such problems are difficult to source (e.g., Riemann Bench and FrontierMath-Tier 4 contain 25 and 50 problems, respectively). To support reliable evaluation of next-generation frontier models, we introduce Soohak, a 439-problem benchmark newly authored from scratch by 64 mathematicians. Soohak comprises two subsets. On the Challenge subset, frontier models including Gemini-3-Pro, GPT-5, and Claude-Opus-4.5 reach 30.4%, 26.4%, and 10.4% respectively, leaving substantial headroom, while leading open-weight models such as Qwen3-235B, GPT-OSS-120B, and Kimi-2.5 remain below 15%. Notably, beyond standard problem solving, Soohak introduces a refusal subset that probes a capability intrinsic to research mathematics: recognizing ill-posed problems and pausing rather than producing confident but unjustified answers. On this subset, no model exceeds 50%, identifying refusal as a new optimization target that current models do not directly address. To prevent contamination, the dataset will be publicly released in late 2026, with model evaluations available upon request in the interim.
翻译:继前沿大语言模型在国际数学奥林匹克竞赛中取得金牌级表现后,学界正致力于寻找下一个有意义且具挑战性的目标以衡量大语言模型的推理能力。奥林匹克竞赛类问题仅衡量逐步推理能力,而研究级问题则需运用此类推理推进数学知识前沿本身,正成为引人注目的替代方案。然而,研究级数学基准仍十分稀缺,因其问题难以获取(例如Riemann Bench与FrontierMath-Tier 4分别仅包含25个与50个问题)。为支撑下一代前沿模型的可靠评估,我们引入Soohak——一个由64位数学家从零全新构建的439道问题基准。Soohak包含两个子集。在Challenge子集中,包括Gemini-3-Pro、GPT-5与Claude-Opus-4.5在内的前沿模型分别达到30.4%、26.4%与10.4%的准确率,留有显著提升空间;而Qwen3-235B、GPT-OSS-120B与Kimi-2.5等领先开源权重模型则低于15%。值得注意的是,除标准解题能力外,Soohak引入了Refusal子集,该子集专门探测研究数学所固有的能力:识别不适定问题并暂停作答,而非给出自信但缺乏依据的回答。在此子集中,所有模型均未超过50%,由此将拒绝回答确立为当前模型尚未直接应对的优化新目标。为防止数据污染,该数据集将于2026年底公开,期间可申请获取模型评估结果。