Reinforcement learning is able to solve complex sequential decision-making tasks but is currently limited by sample efficiency and required computation. To improve sample efficiency, recent work focuses on model-based RL which interleaves model learning with planning. Recent methods further utilize policy learning, value estimation, and, self-supervised learning as auxiliary objectives. In this paper we show that, surprisingly, a simple representation learning approach relying only on a latent dynamics model trained by latent temporal consistency is sufficient for high-performance RL. This applies when using pure planning with a dynamics model conditioned on the representation, but, also when utilizing the representation as policy and value function features in model-free RL. In experiments, our approach learns an accurate dynamics model to solve challenging high-dimensional locomotion tasks with online planners while being 4.1 times faster to train compared to ensemble-based methods. With model-free RL without planning, especially on high-dimensional tasks, such as the DeepMind Control Suite Humanoid and Dog tasks, our approach outperforms model-free methods by a large margin and matches model-based methods' sample efficiency while training 2.4 times faster.
翻译:强化学习能够解决复杂的序列决策任务,但目前受限于样本效率和计算资源。为提高样本效率,近期研究聚焦于基于模型的强化学习,该方法将模型学习与规划交错进行。最新方法进一步利用策略学习、价值估计和自监督学习作为辅助目标。本文发现,令人惊讶的是,仅依靠通过潜在时序一致性训练的潜动态模型的简单表示学习方法,就足以实现高性能强化学习。该方法既适用于使用基于表示条件的动态模型进行纯规划的情况,也适用于在无模型强化学习中将表示作为策略和价值函数特征的情况。实验中,我们的方法能够学习精确的动态模型,配合在线规划器解决高难度高维运动控制任务,同时训练速度比基于集成的方法快4.1倍。在无需规划的无模型强化学习中,特别是在高维任务(如DeepMind控制套件中的人形和狗形任务)上,我们的方法以大幅优势超越无模型方法,并在训练速度快2.4倍的情况下达到基于模型方法的样本效率。