Exploration bonuses in reinforcement learning guide long-horizon exploration by defining custom intrinsic objectives. Count-based methods use the frequency of state visits to derive an exploration bonus. In this paper, we identify that any intrinsic reward function derived from count-based methods is non-stationary and hence induces a difficult objective to optimize for the agent. The key contribution of our work lies in transforming the original non-stationary rewards into stationary rewards through an augmented state representation. For this purpose, we introduce the Stationary Objectives For Exploration (SOFE) framework. SOFE requires identifying sufficient statistics for different exploration bonuses and finding an efficient encoding of these statistics to use as input to a deep network. SOFE is based on proposing state augmentations that expand the state space but hold the promise of simplifying the optimization of the agent's objective. Our experiments show that SOFE improves the agents' performance in challenging exploration problems, including sparse-reward tasks, pixel-based observations, 3D navigation, and procedurally generated environments.
翻译:强化学习中的探索奖励通过定义自定义的内在目标来引导长程探索。基于计数的方法利用状态访问频率推导探索奖励。本文指出,任何基于计数方法推导的内在奖励函数均是非平稳的,因此会给智能体带来难以优化的目标。本研究的关键贡献在于通过增强状态表示,将原始非平稳奖励转化为平稳奖励。为此,我们提出探索平稳目标框架(SOFE)。SOFE需要为不同的探索奖励识别充分统计量,并找到这些统计量的高效编码方式以作为深度网络的输入。该框架通过提出状态增强策略来扩展状态空间,同时简化智能体目标的优化过程。实验表明,SOFE在具有挑战性的探索问题中提升了智能体性能,包括稀疏奖励任务、基于像素的观测、三维导航以及程序化生成环境。