Model-free reinforcement learning (RL) can be used to learn effective policies for complex tasks, such as Atari games, even from image observations. However, this typically requires very large amounts of interaction -- substantially more, in fact, than a human would need to learn the same games. How can people learn so quickly? Part of the answer may be that people can learn how the game works and predict which actions will lead to desirable outcomes. In this paper, we explore how video prediction models can similarly enable agents to solve Atari games with fewer interactions than model-free methods. We describe Simulated Policy Learning (SimPLe), a complete model-based deep RL algorithm based on video prediction models and present a comparison of several model architectures, including a novel architecture that yields the best results in our setting. Our experiments evaluate SimPLe on a range of Atari games in low data regime of 100k interactions between the agent and the environment, which corresponds to two hours of real-time play. In most games SimPLe outperforms state-of-the-art model-free algorithms, in some games by over an order of magnitude.
翻译:无模型强化学习可用于学习复杂任务(如Atari游戏)的有效策略,即使仅基于图像观测。然而,这通常需要极其大量的交互——事实上,其交互量远超人类学习相同游戏所需。人类为何能如此快速地学习?部分答案可能在于人类能够理解游戏机制,并预测哪些行动将带来理想结果。本文探索了视频预测模型如何类似地使智能体在比无模型方法更少的交互次数下解决Atari游戏。我们描述了Simulated Policy Learning(SimPLe),一种完整的基于视频预测模型的深度强化学习算法,并比较了多种模型架构,其中一种新颖架构在我们的设定中取得了最佳结果。实验在低数据量场景下(智能体与环境交互仅10万次,相当于两小时实时游戏)评估了SimPLe在多个Atari游戏上的表现。在多数游戏中,SimPLe优于最先进的无模型算法,部分游戏的性能提升甚至超过一个数量级。