Model-based reinforcement learning (RL) has demonstrated remarkable successes on a range of continuous control tasks due to its high sample efficiency. To save the computation cost of conducting planning online, recent practices tend to distill optimized action sequences into an RL policy during the training phase. Although the distillation can incorporate both the foresight of planning and the exploration ability of RL policies, the theoretical understanding of these methods is yet unclear. In this paper, we extend the policy improvement step of Soft Actor-Critic (SAC) by developing an approach to distill from model-based planning to the policy. We then demonstrate that such an approach of policy improvement has a theoretical guarantee of monotonic improvement and convergence to the maximum value defined in SAC. We discuss effective design choices and implement our theory as a practical algorithm -- Model-based Planning Distilled to Policy (MPDP) -- that updates the policy jointly over multiple future time steps. Extensive experiments show that MPDP achieves better sample efficiency and asymptotic performance than both model-free and model-based planning algorithms on six continuous control benchmark tasks in MuJoCo.
翻译:基于模型的强化学习(RL)由于样本效率高,在一系列连续控制任务中取得了显著成功。为了节省在线规划的计算成本,近期实践倾向于在训练阶段将优化后的动作序列提取为强化学习策略。尽管这种提取方法能够融合规划的预见性和强化学习策略的探索能力,但对此类方法的理论理解尚不清晰。本文通过开发一种从基于模型的规划到策略的提取方法,扩展了柔性演员-评论家(SAC)的策略改进步骤。随后,我们证明了这种策略改进方法具有单调改进的理论保证,并能收敛到SAC中定义的最大值。我们讨论了有效的设计选择,并将理论实现为实用算法——基于模型规划的策略提取(MPDP),该算法在多个未来时间步上联合更新策略。大量实验表明,在MuJoCo的六项连续控制基准任务中,MPDP在样本效率和渐进性能上均优于无模型和基于模型的规划算法。