Goal-Conditioned Q-Learning as Knowledge Distillation

Many applications of reinforcement learning can be formalized as goal-conditioned environments, where, in each episode, there is a "goal" that affects the rewards obtained during that episode but does not affect the dynamics. Various techniques have been proposed to improve performance in goal-conditioned environments, such as automatic curriculum generation and goal relabeling. In this work, we explore a connection between off-policy reinforcement learning in goal-conditioned settings and knowledge distillation. In particular: the current Q-value function and the target Q-value estimate are both functions of the goal, and we would like to train the Q-value function to match its target for all goals. We therefore apply Gradient-Based Attention Transfer (Zagoruyko and Komodakis 2017), a knowledge distillation technique, to the Q-function update. We empirically show that this can improve the performance of goal-conditioned off-policy reinforcement learning when the space of goals is high-dimensional. We also show that this technique can be adapted to allow for efficient learning in the case of multiple simultaneous sparse goals, where the agent can attain a reward by achieving any one of a large set of objectives, all specified at test time. Finally, to provide theoretical support, we give examples of classes of environments where (under some assumptions) standard off-policy algorithms such as DDPG require at least O(d^2) replay buffer transitions to learn an optimal policy, while our proposed technique requires only O(d) transitions, where d is the dimensionality of the goal and state space. Code is available at https://github.com/alevine0/ReenGAGE.

翻译：许多强化学习应用可形式化为目标条件化环境，其中每个回合存在一个影响该回合奖励但不影响动态过程的“目标”。已有多种技术被提出用于提升目标条件化环境的性能，例如自动课程生成和目标重标注。本文探讨了目标条件化离线策略强化学习与知识蒸馏之间的联系：具体而言，当前Q值函数和目标Q值估计均为目标的函数，我们希望训练Q值函数使其在所有目标下匹配其目标值。因此，我们将梯度注意力迁移技术（Zagoruyko and Komodakis 2017）——一种知识蒸馏方法——应用于Q函数更新。实验表明，当目标空间为高维时，该方法能提升目标条件化离线策略强化学习的性能。我们还证明，该技术可适配于多稀疏目标同时存在的情况——智能体在测试时通过实现任意一个大规模目标集合中的目标即可获得奖励，从而支持高效学习。最后，为提供理论支撑，我们给出了环境类别的示例：在特定假设下，标准离线策略算法（如DDPG）需要至少O(d²)个经验回放池转换样本才能学得最优策略，而本文提出方法仅需O(d)个样本，其中d为目标和状态空间的维度。代码已开源：https://github.com/alevine0/ReenGAGE。