We study zero-shot generalization in reinforcement learning - optimizing a policy on a set of training tasks such that it will perform well on a similar but unseen test task. To mitigate overfitting, previous work explored different notions of invariance to the task. However, on problems such as the ProcGen Maze, an adequate solution that is invariant to the task visualization does not exist, and therefore invariance-based approaches fail. Our insight is that learning a policy that $\textit{explores}$ the domain effectively is harder to memorize than a policy that maximizes reward for a specific task, and therefore we expect such learned behavior to generalize well; we indeed demonstrate this empirically on several domains that are difficult for invariance-based approaches. Our $\textit{Explore to Generalize}$ algorithm (ExpGen) builds on this insight: We train an additional ensemble of agents that optimize reward. At test time, either the ensemble agrees on an action, and we generalize well, or we take exploratory actions, which are guaranteed to generalize and drive us to a novel part of the state space, where the ensemble may potentially agree again. We show that our approach is the state-of-the-art on several tasks in the ProcGen challenge that have so far eluded effective generalization. For example, we demonstrate a success rate of $82\%$ on the Maze task and $74\%$ on Heist with $200$ training levels.
翻译:我们研究强化学习中的零样本泛化问题——在一组训练任务上优化策略,使其在相似但未见过的测试任务上表现良好。为缓解过拟合,先前工作探索了针对任务的不同不变性概念。然而,在处理诸如ProcGen Maze等问题时,不存在对任务可视化具有不变性的适当解决方案,因此基于不变性的方法失效了。我们的洞察在于:学习一种有效$\textit{探索}$领域的策略,比针对特定任务最大化奖励的策略更难记忆,因此我们预期这种学习到的行为能够良好泛化;我们确实在多个难以通过基于不变性方法处理的领域上通过实验证明了这一点。我们的$\textit{探索以实现泛化}$算法(ExpGen)基于此洞察:我们训练一个额外的集成智能体来优化奖励。在测试时,要么集成体对某个动作达成一致,从而良好泛化;要么我们采取具有泛化保证的探索性动作,驱使策略进入状态空间的新区域,此时集成体可能再次达成一致。我们证明,该方法在ProcGen挑战中多个此前难以实现有效泛化的任务上达到了最先进水平。例如,在Maze任务上我们实现了$82\%$的成功率,在Heist任务上实现了$74\%$的成功率(基于200个训练关卡)。