Training large language models (LLMs) to reason via reinforcement learning (RL) significantly improves their problem-solving capabilities. In agentic settings, existing methods like ReAct prompt LLMs to explicitly plan before every action; however, we demonstrate that always planning is computationally expensive and degrades performance on long-horizon tasks, while never planning further limits performance. To address this, we introduce a conceptual framework formalizing dynamic planning for LLM agents, enabling them to flexibly decide when to allocate test-time compute for planning. We propose a simple two-stage training pipeline: (1) supervised fine-tuning on diverse synthetic data to prime models for dynamic planning, and (2) RL to refine this capability in long-horizon environments. Experiments on the Crafter environment show that dynamic planning agents trained with this approach are more sample-efficient and consistently achieve more complex objectives. Additionally, we demonstrate that these agents can be effectively steered by human-written plans, surpassing their independent capabilities and highlighting the potential for safer and more collaborative agentic systems.
翻译:通过强化学习(RL)训练大型语言模型(LLM)进行推理,能显著提升其问题解决能力。在代理任务场景中,现有方法(如ReAct)提示LLM在每次行动前都进行显式规划;然而,我们证明持续规划会导致计算成本高昂,并在长时程任务中降低性能,而从不规划则会进一步限制性能。为解决这一问题,我们提出了一个形式化LLM代理动态规划的概念框架,使其能够灵活决定何时分配测试时计算资源用于规划。我们提出一个简单的两阶段训练流程:(1)在多样化的合成数据上进行监督微调,使模型具备动态规划的基础能力;(2)在长时程环境中通过RL优化这一能力。在Crafter环境中的实验表明,采用此方法训练的动态规划代理具有更高的样本效率,并能持续达成更复杂的目标。此外,我们证明这些代理能够有效遵循人工编写的规划,超越其独立能力,展现了构建更安全、更具协作性的代理系统的潜力。