Hindsight goal relabeling has become a foundational technique in multi-goal reinforcement learning (RL). The essential idea is that any trajectory can be seen as a sub-optimal demonstration for reaching its final state. Intuitively, learning from those arbitrary demonstrations can be seen as a form of imitation learning (IL). However, the connection between hindsight goal relabeling and imitation learning is not well understood. In this paper, we propose a novel framework to understand hindsight goal relabeling from a divergence minimization perspective. Recasting the goal reaching problem in the IL framework not only allows us to derive several existing methods from first principles, but also provides us with the tools from IL to improve goal reaching algorithms. Experimentally, we find that under hindsight relabeling, Q-learning outperforms behavioral cloning (BC). Yet, a vanilla combination of both hurts performance. Concretely, we see that the BC loss only helps when selectively applied to actions that get the agent closer to the goal according to the Q-function. Our framework also explains the puzzling phenomenon wherein a reward of (-1, 0) results in significantly better performance than a (0, 1) reward for goal reaching.
翻译:事后目标重标记已成为多目标强化学习(RL)中的基础技术。其核心思想在于:任何轨迹均可被视为达到其最终状态的次优示范。直观上,从这些任意示范中学习可视为一种模仿学习(IL)形式。然而,事后目标重标记与模仿学习之间的关联尚未得到充分理解。本文提出一种基于散度最小化视角的全新框架以理解事后目标重标记。将目标达成问题重塑为模仿学习框架,不仅使我们能从基本原理推导出若干现有方法,还为我们提供了改进目标达成算法的模仿学习工具。实验发现,在事后重标记机制下,Q学习优于行为克隆(BC),但两者的朴素组合反而会损害性能。具体而言,BC损失仅在根据Q函数选择性地应用于能使智能体更接近目标的动作时才有帮助。我们的框架还解释了目标达成任务中(-1,0)奖励显著优于(0,1)奖励这一令人费解的现象。