Large language models now achieve high final-answer accuracy on mathematical reasoning benchmarks, but accuracy alone does not capture reasoning flexibility. We introduce a strategy-level evaluation framework instantiated on 80 AMC 10/12 and AIME problems with 217 AoPS-derived reference strategy families. Model outputs are annotated for strategy identity, validity, and correctness using dual-AI coding with human adjudication. Across four frontier models, we find a pronounced decoupling between answer accuracy and strategy diversity. Under a single-solution prompt, all models achieve high accuracy (95%-100%), but under a multiple-strategy prompt they recover substantially fewer strategies than the human reference set. Gemini, DeepSeek, GPT, and Claude generate 184, 152, 151, and 110 distinct valid strategies, respectively, with the largest gaps in Geometry and Number Theory. The models collectively produce 50 benchmark-novel valid strategies, indicating both incomplete coverage of human strategies and some capacity for alternative reasoning. A repeated-run robustness check on 20 problems shows diminishing gains in discovered strategies, with the strongest model recovering only 39 of 55 AoPS-reference strategies (71%) after three runs. These findings position strategy diversity as a complementary dimension for evaluating mathematical reasoning beyond answer correctness.
翻译:大语言模型如今在数学推理基准测试上达到了很高的最终答案准确率,但仅凭准确率无法衡量推理的灵活性。我们提出了一个策略层面的评估框架,该框架在80个AMC 10/12和AIME问题以及217个源自AoPS的参考策略族上得到了实例化。模型输出通过采用人工裁决的双AI编码方式,对策略标识、有效性和正确性进行了标注。在对四个前沿模型的评估中,我们发现答案准确率与策略多样性之间存在显著解耦。在单一解答提示下,所有模型都实现了高准确率(95%-100%),但在多策略提示下,它们恢复出的策略数量远少于人类参考集。Gemini、DeepSeek、GPT和Claude分别生成了184、152、151和110种不同的有效策略,其中在几何和数论领域的差距最大。这些模型共同产生了50种基准测试中未见过的有效策略,这表明它们既未能完全覆盖人类的策略,也具备一定的替代推理能力。对20个问题进行的重复运行稳健性检查显示,发现的策略增益递减,最强模型在三次运行后仅恢复出55个AoPS参考策略中的39个(71%)。这些发现将策略多样性定位为评估数学推理(超越答案正确性)的一个互补维度。