Animals have a developed ability to explore that aids them in important tasks such as locating food, exploring for shelter, and finding misplaced items. These exploration skills necessarily track where they have been so that they can plan for finding items with relative efficiency. Contemporary exploration algorithms often learn a less efficient exploration strategy because they either condition only on the current state or simply rely on making random open-loop exploratory moves. In this work, we propose $\eta\psi$-Learning, a method to learn efficient exploratory policies by conditioning on past episodic experience to make the next exploratory move. Specifically, $\eta\psi$-Learning learns an exploration policy that maximizes the entropy of the state visitation distribution of a single trajectory. Furthermore, we demonstrate how variants of the predecessor representation and successor representations can be combined to predict the state visitation entropy. Our experiments demonstrate the efficacy of $\eta\psi$-Learning to strategically explore the environment and maximize the state coverage with limited samples.
翻译:动物具有发达的探索能力,这有助于其完成诸如寻找食物、探索栖息地以及找回遗失物品等重要任务。这种探索能力必然要追踪个体已到达的路径,以便相对高效地规划寻找目标。现有的探索算法往往学习到效率较低的探索策略,原因在于它们要么仅以当前状态为条件,要么完全依赖随机开环探索动作。在本工作中,我们提出$\eta\psi$-学习算法,该方法通过利用过往片段经验作为条件来生成下一步探索动作,从而学习高效的探索策略。具体而言,$\eta\psi$-学习算法能够学习一种探索策略,该策略可最大化单条轨迹的状态访问分布熵。此外,我们展示了如何将前驱表示与后继表示的变体相结合,以预测状态访问熵。实验结果表明,$\eta\psi$-学习算法能够策略性地探索环境,在有限样本下最大化状态覆盖范围。