Recent advances in reasoning Large Language Models (LLMs) have primarily relied on upfront thinking, where reasoning occurs before final answer. However, this approach suffers from critical limitations in code generation, where upfront thinking is often insufficient as problems' full complexity only reveals itself during code implementation. Moreover, it cannot adaptively allocate reasoning effort throughout the code generation process where difficulty varies significantly. In this paper, we propose Think-Anywhere, a novel reasoning mechanism that enables LLMs to invoke thinking on-demand at any token position during code generation. We achieve Think-Anywhere by first teaching LLMs to imitate the reasoning patterns through cold-start training, then leveraging outcome-based RL rewards to drive the model's autonomous exploration of when and where to invoke reasoning. Extensive experiments on four mainstream code generation benchmarks (i.e., LeetCode, LiveCodeBench, HumanEval, and MBPP) show that Think-Anywhere achieves state-of-the-art performance over both existing reasoning methods and recent post-training approaches, while demonstrating consistent generalization across diverse LLMs. Our analysis further reveals that Think-Anywhere enables the model to adaptively invoke reasoning at high-entropy positions, providing enhanced interpretability.
翻译:近期推理大语言模型(LLMs)的进展主要依赖于前置思考,即在最终答案生成前完成推理过程。然而,这种方法在代码生成中存在关键缺陷:由于问题的全部复杂性仅在代码实现过程中才得以揭示,前置思考往往不够充分。此外,当代码生成过程中各环节难度差异显著时,该方法无法自适应地分配推理算力。本文提出"随处思考"(Think-Anywhere)这一新型推理机制,使得LLMs能够在代码生成过程中的任意词元位置按需触发思考。我们首先通过冷启动训练教导LLMs模仿推理模式,继而利用基于结果的强化学习奖励驱动模型自主探索何时及何处应当启用思考。在四个主流代码生成基准测试(即LeetCode、LiveCodeBench、HumanEval与MBPP)上的大量实验表明,相比现有推理方法及近期后训练方法,"随处思考"均取得最优性能,并在多种LLMs上展现出稳定的泛化能力。进一步分析揭示,该机制能使模型在高熵位置自适应地触发推理,从而增强可解释性。