Exploration bonuses in reinforcement learning guide long-horizon exploration by defining custom intrinsic objectives. Several exploration objectives like count-based bonuses, pseudo-counts, and state-entropy maximization are non-stationary and hence are difficult to optimize for the agent. While this issue is generally known, it is usually omitted and solutions remain under-explored. The key contribution of our work lies in transforming the original non-stationary rewards into stationary rewards through an augmented state representation. For this purpose, we introduce the Stationary Objectives For Exploration (SOFE) framework. SOFE requires identifying sufficient statistics for different exploration bonuses and finding an efficient encoding of these statistics to use as input to a deep network. SOFE is based on proposing state augmentations that expand the state space but hold the promise of simplifying the optimization of the agent's objective. We show that SOFE improves the performance of several exploration objectives, including count-based bonuses, pseudo-counts, and state-entropy maximization. Moreover, SOFE outperforms prior methods that attempt to stabilize the optimization of intrinsic objectives. We demonstrate the efficacy of SOFE in hard-exploration problems, including sparse-reward tasks, pixel-based observations, 3D navigation, and procedurally generated environments.
翻译:强化学习中的探索奖励通过定义自定义内在目标来引导长程探索。诸如基于计数的奖励、伪计数和状态熵最大化等若干探索目标是非平稳的,因此难以被智能体优化。尽管这一问题广为人知,但通常被忽略,相关解决方案仍未被充分探索。我们工作的核心贡献在于通过增强状态表示将原始非平稳奖励转化为平稳奖励。为此,我们提出了"探索的固定目标"(SOFE)框架。SOFE需要为不同探索奖励识别充分统计量,并寻找这些统计量的高效编码方式以作为深度网络的输入。SOFE基于提出状态增强方案,该方法扩展了状态空间,但有望简化智能体目标的优化过程。我们证明,SOFE能提升包括基于计数的奖励、伪计数和状态熵最大化在内的多种探索目标的性能。此外,SOFE优于先前尝试稳定内在目标优化的方法。我们通过在困难探索问题(包括稀疏奖励任务、基于像素的观测、3D导航和程序生成环境)中的实验,验证了SOFE的有效性。