We present the first mechanistic evidence that model-free reinforcement learning agents can learn to plan. This is achieved by applying a methodology based on concept-based interpretability to a model-free agent in Sokoban -- a commonly used benchmark for studying planning. Specifically, we demonstrate that DRC, a generic model-free agent introduced by Guez et al. (2019), uses learned concept representations to internally formulate plans that both predict the long-term effects of actions on the environment and influence action selection. Our methodology involves: (1) probing for planning-relevant concepts, (2) investigating plan formation within the agent's representations, and (3) verifying that discovered plans (in the agent's representations) have a causal effect on the agent's behavior through interventions. We also show that the emergence of these plans coincides with the emergence of a planning-like property: the ability to benefit from additional test-time compute. Finally, we perform a qualitative analysis of the planning algorithm learned by the agent and discover a strong resemblance to parallelized bidirectional search. Our findings advance understanding of the internal mechanisms underlying planning behavior in agents, which is important given the recent trend of emergent planning and reasoning capabilities in LLMs through RL
翻译:我们首次提供了机制性证据,表明无模型强化学习智能体能够学会规划。这一结论是通过将基于概念的可解释性方法应用于Sokoban(研究规划的常用基准环境)中的无模型智能体而实现的。具体而言,我们证明了由Guez等人(2019)提出的通用无模型智能体DRC能够利用习得的概念表征,在内部构建既能预测行动对环境的长期影响又能影响行动选择的规划方案。我们的研究方法包括:(1)探测与规划相关的概念,(2)在智能体表征内部研究规划的形成过程,以及(3)通过干预验证所发现的规划方案(在智能体表征中)对智能体行为具有因果效应。我们还证明这些规划方案的出现与一种类规划特性(即从额外测试时计算中获益的能力)的涌现同步发生。最后,我们对智能体习得的规划算法进行了定性分析,发现其与并行化双向搜索算法具有高度相似性。鉴于当前大型语言模型通过强化学习涌现规划与推理能力的新趋势,我们的研究结果深化了对智能体规划行为内在机制的理解。