The focus of this work is sample-efficient deep reinforcement learning (RL) with a simulator. One useful property of simulators is that it is typically easy to reset the environment to a previously observed state. We propose an algorithmic framework, named uncertainty-first local planning (UFLP), that takes advantage of this property. Concretely, in each data collection iteration, with some probability, our meta-algorithm resets the environment to an observed state which has high uncertainty, instead of sampling according to the initial-state distribution. The agent-environment interaction then proceeds as in the standard online RL setting. We demonstrate that this simple procedure can dramatically improve the sample cost of several baseline RL algorithms on difficult exploration tasks. Notably, with our framework, we can achieve super-human performance on the notoriously hard Atari game, Montezuma's Revenge, with a simple (distributional) double DQN. Our work can be seen as an efficient approximate implementation of an existing algorithm with theoretical guarantees, which offers an interpretation of the positive empirical results.
翻译:本文聚焦于利用模拟器实现样本高效的深度强化学习(RL)。模拟器的一个有用特性是能够轻松地将环境重置回先前观测到的状态。我们提出了一种名为“不确定性优先局部规划”(UFLP)的算法框架,利用了这一特性。具体而言,在每个数据收集迭代中,我们的元算法以一定概率将环境重置为具有高不确定性的观测状态,而非按照初始状态分布进行采样。随后,智能体与环境的交互按照标准在线强化学习设置进行。我们证明,这一简单过程能够显著降低多个基线强化学习算法在困难探索任务中的样本成本。值得注意的是,通过我们的框架,仅使用简单的(分布式)双重DQN,即可在公认难度极大的Atari游戏《蒙提祖玛的复仇》中实现超越人类水平的性能。我们的工作可被视为一种现有理论保证算法的高效近似实现,这为积极的实证结果提供了解释。