First-Explore, then Exploit: Meta-Learning Intelligent Exploration

Standard reinforcement learning (RL) agents never intelligently explore like a human (i.e. by taking into account complex domain priors and previous explorations). Even the most basic intelligent exploration strategies such as exhaustive search are only inefficiently or poorly approximated by approaches such as novelty search or intrinsic motivation, let alone more complicated strategies like learning new skills, climbing stairs, opening doors, or conducting experiments. This lack of intelligent exploration limits sample efficiency and prevents solving hard exploration domains. We argue a core barrier prohibiting many RL approaches from learning intelligent exploration is that the methods attempt to explore and exploit simultaneously, which harms both exploration and exploitation as the goals often conflict. We propose a novel meta-RL framework (First-Explore) with two policies: one policy learns to only explore and one policy learns to only exploit. Once trained, we can then explore with the explore policy, for as long as desired, and then exploit based on all the information gained during exploration. This approach avoids the conflict of trying to do both exploration and exploitation at once. We demonstrate that First-Explore can learn intelligent exploration strategies such as exhaustive search and more, and that it outperforms dominant standard RL and meta-RL approaches on domains where exploration requires sacrificing reward. First-Explore is a significant step towards creating meta-RL algorithms capable of learning human-level exploration which is essential to solve challenging unseen hard-exploration domains.

翻译：标准强化学习（RL）智能体从未像人类一样智能地探索（即，考虑复杂领域先验知识和先前的探索）。即使是最基本的智能探索策略，如穷举搜索，也只能通过新奇搜索或内在动机等方法低效或粗略地近似，更不用说学习新技能、爬楼梯、开门或进行实验等更复杂的策略。这种智能探索的缺失限制了样本效率，并阻碍了解决困难探索领域。我们认为，许多RL方法无法学习智能探索的一个核心障碍是，这些方法试图同时进行探索和利用，但由于目标经常冲突，这损害了探索和利用的效果。我们提出了一种新颖的元RL框架（First-Explore），包含两个策略：一个策略仅学习探索，另一个策略仅学习利用。一旦训练完成，我们可以根据需要尽可能长时间地使用探索策略进行探索，然后根据探索过程中获得的所有信息进行利用。这种方法避免了同时进行探索和利用的冲突。我们证明，First-Explore可以学习到诸如穷举搜索等智能探索策略，并且在需要牺牲奖励进行探索的领域上，其表现优于主流的标准RL和元RL方法。First-Explore向着创建能够学习人类级探索的元RL算法迈出了重要一步，这对于解决具有挑战性的未知困难探索领域至关重要。