Reasoning ability has become a central focus in the advancement of Large Reasoning Models (LRMs). Although notable progress has been achieved on several reasoning benchmarks such as MATH500 and LiveCodeBench, existing benchmarks for algorithmic reasoning remain limited, failing to answer a critical question: Do LRMs truly master algorithmic reasoning? To answer this question, we propose AlgBench, an expert-curated benchmark that evaluates LRMs under an algorithm-centric paradigm. AlgBench consists of over 3,000 original problems spanning 27 algorithms, constructed by ACM algorithmic experts and organized under a comprehensive taxonomy, including Euclidean-structured, non-Euclidean-structured, non-optimized, local-optimized, global-optimized, and heuristic-optimized categories. Empirical evaluations on leading LRMs (e.g., Gemini-3-Pro, DeepSeek-v3.2-Speciale and GPT-o3) reveal substantial performance heterogeneity: while models perform well on non-optimized tasks (up to 92%), accuracy drops sharply to around 49% on globally optimized algorithms such as dynamic programming. Further analysis uncovers \textbf{strategic over-shifts}, wherein models prematurely abandon correct algorithmic designs due to necessary low-entropy tokens. These findings expose fundamental limitations of problem-centric reinforcement learning and highlight the necessity of an algorithm-centric training paradigm for robust algorithmic reasoning.
翻译:推理能力已成为大型推理模型发展的核心焦点。尽管在MATH500和LiveCodeBench等多项推理基准测试中取得了显著进展,但现有的算法推理基准仍存在局限,未能回答一个关键问题:LRMs是否真正掌握了算法推理?为回答此问题,我们提出了AlgBench——一个由专家构建的、以算法为中心的评估基准。AlgBench包含超过3000道原创题目,涵盖27种算法,由ACM算法专家构建,并按照综合分类体系组织,包括欧几里得结构、非欧几里得结构、非优化、局部优化、全局优化及启发式优化等类别。对主流LRM(如Gemini-3-Pro、DeepSeek-v3.2-Speciale和GPT-o3)的实证评估显示出显著的性能异质性:模型在非优化任务上表现良好(最高达92%),但在动态规划等全局优化算法上的准确率骤降至约49%。进一步分析揭示了**策略性过度偏移**现象,即模型因必要的低熵标记而过早放弃正确的算法设计。这些发现暴露了以问题为中心的强化学习的根本局限,并凸显了采用以算法为中心的训练范式对于实现稳健算法推理的必要性。