从抽象到情境化：大语言模型在数学领域仍无法胜任的任务 (From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics)

Large language models now solve many benchmark math problems at near-expert levels, yet this progress has not fully translated into reliable performance in real-world applications. We study this gap through contextual mathematical reasoning, where the mathematical core must be formulated from descriptive scenarios. We introduce ContextMATH, a benchmark that repurposes AIME and MATH-500 problems into two contextual settings: Scenario Grounding (SG), which embeds abstract problems into realistic narratives without increasing reasoning complexity, and Complexity Scaling (CS), which transforms explicit conditions into sub-problems to capture how constraints often appear in practice. Evaluating 61 proprietary and open-source models, we observe sharp drops: on average, open-source models decline by 13 and 34 points on SG and CS, while proprietary models drop by 13 and 20. Error analysis shows that errors are dominated by incorrect problem formulation, with formulation accuracy declining as original problem difficulty increases. Correct formulation emerges as a prerequisite for success, and its sufficiency improves with model scale, indicating that larger models advance in both understanding and reasoning. Nevertheless, formulation and reasoning remain two complementary bottlenecks that limit contextual mathematical problem solving. Finally, we find that fine-tuning with scenario data improves performance, whereas formulation-only training is ineffective. However, performance gaps are only partially alleviated, highlighting contextual mathematical reasoning as a central unsolved challenge for LLMs.

翻译：大型语言模型如今已在许多基准数学问题上达到接近专家的解决水平，然而这一进展尚未完全转化为现实应用中的可靠性能。我们通过情境化数学推理来研究这一差距，即数学核心必须从描述性场景中构建。我们提出了ContextMATH基准，该基准将AIME和MATH-500问题重新构建为两种情境化设置：场景嵌入（SG），将抽象问题嵌入现实叙事而不增加推理复杂度；以及复杂度扩展（CS），将显式条件转化为子问题以捕捉实践中约束条件的常见呈现方式。通过对61个专有和开源模型的评估，我们观察到性能的急剧下降：平均而言，开源模型在SG和CS上分别下降13分和34分，而专有模型分别下降13分和20分。错误分析表明，错误主要源于问题构建错误，且构建准确率随原始问题难度增加而下降。正确的问题构建成为成功的先决条件，其充分性随模型规模提升而改善，表明更大规模的模型在理解和推理方面均有进步。尽管如此，问题构建与推理仍是两个相互制约的瓶颈，限制了情境化数学问题的解决。最后，我们发现使用场景数据进行微调可提升性能，而仅针对问题构建的训练则效果不佳。然而，性能差距仅得到部分缓解，这突显情境化数学推理仍是LLMs面临的核心未解难题。