Reasoning with a Code Interpreter (CI) has emerged as an effective paradigm for enhancing the reasoning capabilities of large language models (LLMs) through executable computation and iterative verification. Despite its growing adoption, the behavioral properties underlying effective code reasoning remain largely underexplored. In this work, we investigate code reasoning from two distinct perspectives inspired by prior studies of natural language reasoning: extrinsic properties, represented by crucial tokens, and intrinsic properties, represented by code-specific cognitive behaviors. Across multiple LLMs, we find that stronger CI reasoning models consistently exhibit a higher prevalence of crucial tokens and cognitive behaviors, particularly verification, backtracking, and backward chaining. Building on these observations, we examine how these properties can be leveraged during both inference and training. At inference time, appending code-specific crucial tokens improves performance on several reasoning capabilities, including mathematical, ordering, and optimization, while yielding limited benefits elsewhere. At training time, augmenting a state-of-the-art framework with code-specific cognitive behaviors improves supervised fine-tuning and reinforcement learning performance in two of three evaluated models. Further analysis shows that these behaviors reduce overthinking in incorrect responses and improve token efficiency, while also revealing factors that limit gains in a certain model. Our findings provide the first systematic characterization of effective reasoning with CI and demonstrate both the potential and limitations of leveraging key properties to improve CI-based reasoning.
翻译:借助代码解释器进行推理已成为一种有效范式,通过可执行计算和迭代验证增强大型语言模型的推理能力。尽管该技术被广泛采用,但有效代码推理背后的行为属性仍鲜有探索。受自然语言推理相关研究的启发,本文从两个不同视角研究代码推理:外在属性(以关键标记为代表)与内在属性(以代码特有的认知行为为代表)。跨多个大型语言模型的实验表明,更强的代码解释器推理模型始终展现出更高频率的关键标记和认知行为,尤其是验证、回溯和反向链。基于这些发现,我们进一步研究了如何在推理和训练过程中利用这些属性。在推理阶段,附加代码特有认知标记在数学、排序和优化等多项推理能力上提升了性能,但在其他场景收益有限。在训练阶段,将代码特有认知行为注入最先进的框架后,三个评估模型中有两个在监督微调和强化学习上的表现得到提升。进一步分析表明,这些行为能减少错误回答中的过度思考并提高标记效率,同时也揭示了限制特定模型增益的因素。本研究首次系统刻画了代码解释器有效推理的特征,并展示了利用关键属性改进代码推理的潜力与局限性。