In environments with sparse rewards, finding a good inductive bias for exploration is crucial to the agent's success. However, there are two competing goals: novelty search and systematic exploration. While existing approaches such as curiosity-driven exploration find novelty, they sometimes do not systematically explore the whole state space, akin to depth-first-search vs breadth-first-search. In this paper, we propose a new intrinsic reward that is cyclophobic, i.e., it does not reward novelty, but punishes redundancy by avoiding cycles. Augmenting the cyclophobic intrinsic reward with a sequence of hierarchical representations based on the agent's cropped observations we are able to achieve excellent results in the MiniGrid and MiniHack environments. Both are particularly hard, as they require complex interactions with different objects in order to be solved. Detailed comparisons with previous approaches and thorough ablation studies show that our newly proposed cyclophobic reinforcement learning is more sample efficient than other state of the art methods in a variety of tasks.
翻译:在稀疏奖励环境中,为探索找到良好的归纳偏置对智能体的成功至关重要。然而,存在两个相互竞争的目标:新颖性搜索与系统性探索。尽管好奇心驱动探索等现有方法能发现新颖性,但它们有时会像深度优先搜索与广度优先搜索的差异一样,无法系统性地遍历整个状态空间。本文提出一种新型内在奖励——环恐惧奖励,它不奖励新颖性,而是通过避免循环来惩罚冗余。通过将环恐惧内在奖励与基于智能体裁剪观测的层次化表征序列相结合,我们能在MiniGrid和MiniHack环境中取得优异效果。这两类环境尤为困难,因为它们需要智能体与不同物体进行复杂交互才能被解决。与先前方法的详细对比及全面消融研究表明,我们新提出的环恐惧强化学习在多种任务中的样本效率均优于其他最先进方法。