The focus of this work is sample-efficient deep reinforcement learning (RL) with a simulator. One useful property of simulators is that it is typically easy to reset the environment to a previously observed state. We propose an algorithmic framework, named uncertainty-first local planning (UFLP), that takes advantage of this property. Concretely, in each data collection iteration, with some probability, our meta-algorithm resets the environment to an observed state which has high uncertainty, instead of sampling according to the initial-state distribution. The agent-environment interaction then proceeds as in the standard online RL setting. We demonstrate that this simple procedure can dramatically improve the sample cost of several baseline RL algorithms on difficult exploration tasks. Notably, with our framework, we can achieve super-human performance on the notoriously hard Atari game, Montezuma's Revenge, with a simple (distributional) double DQN. Our work can be seen as an efficient approximate implementation of an existing algorithm with theoretical guarantees, which offers an interpretation of the positive empirical results.
翻译:本文聚焦于基于模拟器的样本高效深度强化学习(RL)。模拟器的一个实用特性在于,可以轻松将环境重置至先前观测过的状态。我们提出一种名为"不确定性优先局部规划"(UFLP)的算法框架,利用了这一特性。具体而言,在每轮数据采集迭代中,我们的元算法以一定概率将环境重置至具有高不确定性的已观测状态,而非按照初始状态分布进行采样。随后代理与环境的交互过程与标准在线强化学习设置相同。实验表明,这一简单流程能显著降低多个基线强化学习算法在困难探索任务中的样本成本。值得注意的是,采用我们框架后,使用简单的(分布式)双DQN即可在极具挑战性的Atari游戏《蒙特祖玛的复仇》中实现超越人类的表现。本研究可视为对具有理论保证的现有算法的一种高效近似实现,为实证结果的优越性提供了理论解释。