Recent reinforcement learning (RL) methods have substantially enhanced the planning capabilities of Large Language Models (LLMs), yet the theoretical basis for their effectiveness remains elusive. In this work, we investigate RL's benefits and limitations through a tractable graph-based abstraction, focusing on policy gradient (PG) and Q-learning methods. Our theoretical analyses reveal that supervised fine-tuning (SFT) may introduce co-occurrence-based spurious solutions, whereas RL achieves correct planning primarily through exploration, underscoring exploration's role in enabling better generalization. However, we also show that PG suffers from diversity collapse, where output diversity decreases during training and persists even after perfect accuracy is attained. By contrast, Q-learning provides two key advantages: off-policy learning and diversity preservation at convergence. We further demonstrate that careful reward design is necessary to prevent Q-value bias in Q-learning. Finally, applying our framework to the real-world planning benchmark Blocksworld, we confirm that these behaviors manifest in practice.
翻译:近期强化学习(RL)方法显著增强了大型语言模型(LLM)的规划能力,但其有效性的理论基础仍不明确。本文通过一种可处理的基于图的抽象模型,重点研究策略梯度(PG)与Q学习方法,探讨RL的优势与局限。理论分析表明,监督微调(SFT)可能引入基于共现的伪解,而RL主要通过探索实现正确规划,这凸显了探索在实现更好泛化中的关键作用。然而,研究也发现PG存在多样性崩溃问题,即训练过程中输出多样性下降,甚至在达到完全准确后仍持续存在。相比之下,Q学习具有两大优势:离策略学习能力与收敛时的多样性保持特性。我们进一步证明,为避免Q学习中的Q值偏差,需要精心设计奖励函数。最后,将本框架应用于现实世界规划基准Blocksworld,我们证实了这些行为在实际场景中的显现。