Model-based reinforcement learning (MBRL) with real-time planning has shown great potential in locomotion and manipulation control tasks. However, the existing planning methods, such as the Cross-Entropy Method (CEM), do not scale well to complex high-dimensional environments. One of the key reasons for underperformance is the lack of exploration, as these planning methods only aim to maximize the cumulative extrinsic reward over the planning horizon. Furthermore, planning inside the compact latent space in the absence of observations makes it challenging to use curiosity-based intrinsic motivation. We propose Curiosity CEM (CCEM), an improved version of the CEM algorithm for encouraging exploration via curiosity. Our proposed method maximizes the sum of state-action Q values over the planning horizon, in which these Q values estimate the future extrinsic and intrinsic reward, hence encouraging reaching novel observations. In addition, our model uses contrastive representation learning to efficiently learn latent representations. Experiments on image-based continuous control tasks from the DeepMind Control suite show that CCEM is by a large margin more sample-efficient than previous MBRL algorithms and compares favorably with the best model-free RL methods.
翻译:基于模型的强化学习(MBRL)结合实时规划在运动控制和操作控制任务中展现出巨大潜力。然而,现有规划方法如交叉熵方法(CEM)难以有效扩展至复杂高维环境。性能不足的关键原因之一是缺乏探索——这些规划方法仅以最大化规划周期内的累积外部奖励为目标。此外,在缺乏观测信息的紧凑潜空间中进行规划,使得基于好奇心的内在动机机制难以应用。我们提出好奇心CEM(CCEM),一种通过好奇心促进探索的CEM算法改进版本。所提出的方法在规划周期内最大化状态-动作Q值之和,其中这些Q值同时估计未来外部奖励与内在奖励,从而鼓励探索新颖观测。同时,我们的模型采用对比表示学习高效学习潜在表征。在DeepMind控制套件中基于图像的连续控制任务实验表明,CCEM的样本效率显著优于先前MBRL算法,且与最佳无模型强化学习方法相比具有竞争力。