Recent large language models (LLMs) achieve near-saturation accuracy on many established mathematical reasoning benchmarks, raising concerns about their ability to diagnose genuine reasoning competence. This saturation largely stems from the dominance of template-based computation and shallow arithmetic decomposition in existing datasets, which underrepresent reasoning skills such as multi-constraint coordination, constructive logical synthesis, and spatial inference. To address this gap, we introduce ReasoningMath-Plus, a benchmark of 150 carefully curated problems explicitly designed to evaluate structural reasoning. Each problem emphasizes reasoning under interacting constraints, constructive solution formation, or non-trivial structural insight, and is annotated with a minimal reasoning skeleton to support fine-grained process-level evaluation. Alongside the dataset, we introduce HCRS (Hazard-aware Chain-based Rule Score), a deterministic step-level scoring function, and train a Process Reward Model (PRM) on the annotated reasoning traces. Empirically, while leading models attain relatively high final-answer accuracy (up to 5.8/10), HCRS-based holistic evaluation yields substantially lower scores (average 4.36/10, best 5.14/10), showing that answer-only metrics can overestimate reasoning robustness.
翻译:近期的大语言模型(LLM)在众多现有数学推理基准测试中达到了接近饱和的准确率,这引发了对其真实推理能力诊断有效性的担忧。这种饱和现象主要源于现有数据集中模板化计算与浅层算术分解的主导地位,这些数据集未能充分体现多约束协调、构造性逻辑综合及空间推理等关键推理技能。为弥补这一不足,我们提出了ReasoningMath-Plus基准,该基准包含150道精心设计的问题,旨在明确评估结构化推理能力。每个问题均强调在交互约束下的推理、构造性解的形成或非平凡的结构性洞察,并标注了最小推理骨架以支持细粒度的过程级评估。除数据集外,我们引入了HCRS(风险感知的链式规则评分)——一种确定性的步骤级评分函数,并基于标注的推理轨迹训练了一个过程奖励模型(PRM)。实证结果表明,尽管主流模型在最终答案准确率上表现相对较高(最高达5.8/10),但基于HCRS的整体评估得分显著偏低(平均4.36/10,最佳5.14/10),这表明仅依赖答案的评估指标可能高估模型的推理鲁棒性。