Visual model-based reinforcement learning (RL) has the potential to enable sample-efficient robot learning from visual observations. Yet the current approaches typically train a single model end-to-end for learning both visual representations and dynamics, making it difficult to accurately model the interaction between robots and small objects. In this work, we introduce a visual model-based RL framework that decouples visual representation learning and dynamics learning. Specifically, we train an autoencoder with convolutional layers and vision transformers (ViT) to reconstruct pixels given masked convolutional features, and learn a latent dynamics model that operates on the representations from the autoencoder. Moreover, to encode task-relevant information, we introduce an auxiliary reward prediction objective for the autoencoder. We continually update both autoencoder and dynamics model using online samples collected from environment interaction. We demonstrate that our decoupling approach achieves state-of-the-art performance on a variety of visual robotic tasks from Meta-world and RLBench, e.g., we achieve 81.7% success rate on 50 visual robotic manipulation tasks from Meta-world, while the baseline achieves 67.9%. Code is available on the project website: https://sites.google.com/view/mwm-rl.
翻译:基于模型的视觉强化学习有潜力通过视觉观测实现高效样本的机器人学习。然而,当前方法通常端到端训练单一模型,同时学习视觉表征与动力学,导致难以精确建模机器人与小物体间的交互。本研究提出一种解耦视觉表征学习与动力学学习的基于模型的视觉强化学习框架。具体而言,我们训练一个包含卷积层与视觉Transformer(ViT)的自编码器,在给定掩蔽卷积特征条件下重建像素,并学习一个基于自编码器表征的潜在动力学模型。此外,为编码任务相关信息,我们引入自编码器的辅助奖励预测目标。我们利用环境交互中收集的在线样本持续更新自编码器与动力学模型。实验证明,本文的解耦方法在Meta-world和RLBench的多种视觉机器人任务上取得了最先进性能,例如在Meta-world的50个视觉机器人操作任务中达到81.7%的成功率,而基线方法仅为67.9%。代码见项目网站:https://sites.google.com/view/mwm-rl。