We investigate the challenge of parametrizing policies for reinforcement learning (RL) in high-dimensional continuous action spaces. Our objective is to develop a multimodal policy that overcomes limitations inherent in the commonly-used Gaussian parameterization. To achieve this, we propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories. By conditioning the policy on a latent variable, we derive a novel variational bound as the optimization objective, which promotes exploration of the environment. We then present a practical model-based RL method, called Reparameterized Policy Gradient (RPG), which leverages the multimodal policy parameterization and learned world model to achieve strong exploration capabilities and high data efficiency. Empirical results demonstrate that our method can help agents evade local optima in tasks with dense rewards and solve challenging sparse-reward environments by incorporating an object-centric intrinsic reward. Our method consistently outperforms previous approaches across a range of tasks. Code and supplementary materials are available on the project page https://haosulab.github.io/RPG/
翻译:我们研究了在高维连续动作空间中为强化学习策略进行参数化的挑战。目标是开发一种多模态策略,以克服常用高斯参数化固有的局限性。为此,我们提出一个原则性框架,将连续强化学习策略建模为最优轨迹的生成模型。通过以潜变量为条件对策略进行约束,我们推导出一种新的变分下界作为优化目标,该目标能促进环境探索。随后,我们提出一种实用的基于模型的强化学习方法——重参数化策略梯度(RPG),该方法利用多模态策略参数化与学习到的世界模型,实现强大的探索能力与高数据效率。实验结果表明,我们的方法能帮助智能体在密集奖励任务中规避局部最优,并通过引入以对象为中心的内在奖励解决具有挑战性的稀疏奖励环境。我们的方法在多种任务中持续优于先前方法。代码和补充材料可在项目页面 https://haosulab.github.io/RPG/ 获取。