Model-based reinforcement learning (RL) often achieves higher sample efficiency in practice than model-free RL by learning a dynamics model to generate samples for policy learning. Previous works learn a dynamics model that fits under the empirical state-action visitation distribution for all historical policies, i.e., the sample replay buffer. However, in this paper, we observe that fitting the dynamics model under the distribution for \emph{all historical policies} does not necessarily benefit model prediction for the \emph{current policy} since the policy in use is constantly evolving over time. The evolving policy during training will cause state-action visitation distribution shifts. We theoretically analyze how this distribution shift over historical policies affects the model learning and model rollouts. We then propose a novel dynamics model learning method, named \textit{Policy-adapted Dynamics Model Learning (PDML)}. PDML dynamically adjusts the historical policy mixture distribution to ensure the learned model can continually adapt to the state-action visitation distribution of the evolving policy. Experiments on a range of continuous control environments in MuJoCo show that PDML achieves significant improvement in sample efficiency and higher asymptotic performance combined with the state-of-the-art model-based RL methods.
翻译:基于模型的强化学习通过学习动力学模型为策略学习生成样本,在实践中通常比无模型强化学习具有更高的样本效率。以往的工作致力于学习一个能拟合所有历史策略的经验状态-动作访问分布(即样本回放缓冲区)的动力学模型。然而,本文观察到,在\textit{所有历史策略}的分布下拟合动力学模型并不一定有利于对\textit{当前策略}的模型预测,因为训练中使用的策略会随时间持续演化。这种演化策略会导致状态-动作访问分布发生偏移。我们从理论上分析了历史策略分布偏移如何影响模型学习与模型展开,并提出了一种新的动力学模型学习方法——\textit{策略自适应动力学模型学习(PDML)}。PDML动态调整历史策略混合分布,确保学到的模型能够持续适应演化策略的状态-动作访问分布。在MuJoCo多个连续控制环境上的实验表明,PDML结合最先进的基于模型的强化学习方法,能够显著提升样本效率并实现更高的渐近性能。