Recent advances in reasoning Large Language Models (LLMs) have primarily relied on upfront thinking, where reasoning occurs before final answer. However, this approach suffers from critical limitations in code generation, where upfront thinking is often insufficient as problems' full complexity only reveals itself during code implementation. Moreover, it cannot adaptively allocate reasoning effort throughout the code generation process where difficulty varies significantly. In this paper, we propose Think-Anywhere, a novel reasoning mechanism that enables LLMs to invoke thinking on-demand at any token position during code generation. We achieve Think-Anywhere by first teaching LLMs to imitate the reasoning patterns through cold-start training, then leveraging outcome-based RL rewards to drive the model's autonomous exploration of when and where to invoke reasoning. Extensive experiments on four mainstream code generation benchmarks (i.e., LeetCode, LiveCodeBench, HumanEval, and MBPP) show that Think-Anywhere achieves state-of-the-art performance over both existing reasoning methods and recent post-training approaches, while demonstrating consistent generalization across diverse LLMs. Our analysis further reveals that Think-Anywhere enables the model to adaptively invoke reasoning at high-entropy positions, providing enhanced interpretability.
翻译:近期推理型大语言模型(LLMs)的进展主要依赖于前置思考,即在最终答案生成前进行推理。然而,这种方法在代码生成中面临关键性局限,因为问题的全貌往往仅在代码实现过程中才得以显现,前置思考难以应对其完整复杂度。此外,前置思考无法根据代码生成过程中显著变化的难度自适应分配推理算力。本文提出"任意位置思考"(Think-Anywhere)这一新型推理机制,使LLMs能够在代码生成过程中按需调用任意标记位置的推理。我们通过冷启动训练教会LLM模仿推理模式,继而利用基于结果的强化学习奖励驱动模型自主探索何时何地调用推理。在四个主流代码生成基准(LeetCode、LiveCodeBench、HumanEval、MBPP)上的大量实验表明,Think-Anywhere相比现有推理方法和近期后训练方法均取得最优性能,并在不同LLM上展现出一致泛化能力。进一步分析揭示,该机制使模型能自适应地在高熵位置触发推理,显著增强可解释性。