Curiosity has established itself as a powerful exploration strategy in deep reinforcement learning. Notably, leveraging expected future novelty as intrinsic motivation has been shown to efficiently generate exploratory trajectories, as well as a robust dynamics model. We consider the challenge of extracting goal-conditioned behavior from the products of such unsupervised exploration techniques, without any additional environment interaction. We find that conventional goal-conditioned reinforcement learning approaches for extracting a value function and policy fall short in this difficult offline setting. By analyzing the geometry of optimal goal-conditioned value functions, we relate this issue to a specific class of estimation artifacts in learned values. In order to mitigate their occurrence, we propose to combine model-based planning over learned value landscapes with a graph-based value aggregation scheme. We show how this combination can correct both local and global artifacts, obtaining significant improvements in zero-shot goal-reaching performance across diverse simulated environments.
翻译:好奇心已被证明是深度强化学习中一种强大的探索策略。值得注意的是,将预期未来新颖性作为内在动机,能够高效生成探索性轨迹以及稳健的动力学模型。我们探讨了从这类无监督探索技术产物中提取目标条件化行为的挑战,且无需额外的环境交互。我们发现,在困难的高线设置下,传统用于提取价值函数与策略的目标条件化强化学习方法表现不足。通过分析最优目标条件化价值函数的几何特性,我们将此问题归因于学习价值中的特定类别估计伪影。为减轻这些伪影的出现,我们提出将基于模型的价值地景规划与基于图的价值聚合方案相结合。我们展示了这种组合如何同时修正局部与全局伪影,并在多种模拟环境中显著提升零样本目标达成性能。