Math reasoning benchmarks have proliferated, yet most lack a per-item difficulty signal grounded in actual human performance. We introduce KCSAT-ML, a decade (2014-2025) of Korean College Scholastic Ability Test (KCSAT; Suneung) mathematics: 664 problems with a 339-item core set carrying official per-item error rates from nationwide cohorts of hundreds of thousands of examinees. We pair the benchmark with Difficulty-aligned Reasoning Gain (DRG): a score-orthogonal metric that asks whether a model's mistakes concentrate on the items humans found hard, or on items humans found easy. Together they expose, across a wide range of VLMs (and LLMs via OCR), three patterns: (i) low-budget accuracy collapses on the high-human-error tail at every model size; (ii) test-time scaling (TTS) raises token use roughly linearly with cohort error rate, while accuracy gains follow a non-monotonic curve; (iii) within a single family, TTS flips between anti-scaling on the hardest items and overthinking on easier ones -- two faces of the same alignment failure. On DRG, models with near-identical accuracy can sit at near-opposite values: one model gets wrong what humans also find hard, while another solves the hardest items yet fails on items humans find easy -- a contrast that aggregate accuracy hides. Our code and dataset builder will be open-sourced at https://github.com/naver-ai/KCSAT-ML.
翻译:数学推理基准测试不断涌现,但大多数缺乏基于真实人类表现的逐项难度信号。我们提出KCSAT-ML,包含2014-2025十年间的韩国大学修学能力考试(KCSAT;Suneung)数学试题:664道题,其中339道核心试题附带来自数十万考生全国队列的官方逐项错误率。我们为该基准配套提出难度对齐推理增益(DRG):一种分数正交指标,用于评估模型的错误是集中在人类认为困难的题目上,还是集中在人类认为容易的题目上。两者结合,在多种视觉语言模型(及通过光学字符识别处理的纯语言模型)中揭示了三种模式:(i)在每种模型规模下,低预算精度在高人类错误率尾端崩溃;(ii)测试时扩展(TTS)使令牌使用量随队列错误率大致线性增加,而精度增益遵循非单调曲线;(iii)在同一模型家族内,TTS在最困难题目上的反缩放与较容易题目上的过度思考之间切换——这是同一对齐失败的两个方面。在DRG指标上,精度近乎相同的模型可能处于几乎相反的数值:一个模型在人类认为困难的题目上犯错,而另一个模型解决了最难的题目,却在人类认为容易的题目上失败——这种对比被聚合精度所掩盖。我们的代码和数据集构建工具将在 https://github.com/naver-ai/KCSAT-ML 开源。