While unsupervised skill discovery has shown promise in autonomously acquiring behavioral primitives, there is still a large methodological disconnect between task-agnostic skill pretraining and downstream, task-aware finetuning. We present Intrinsic Reward Matching (IRM), which unifies these two phases of learning via the $\textit{skill discriminator}$, a pretraining model component often discarded during finetuning. Conventional approaches finetune pretrained agents directly at the policy level, often relying on expensive environment rollouts to empirically determine the optimal skill. However, often the most concise yet complete description of a task is the reward function itself, and skill learning methods learn an $\textit{intrinsic}$ reward function via the discriminator that corresponds to the skill policy. We propose to leverage the skill discriminator to $\textit{match}$ the intrinsic and downstream task rewards and determine the optimal skill for an unseen task without environment samples, consequently finetuning with greater sample-efficiency. Furthermore, we generalize IRM to sequence skills for complex, long-horizon tasks and demonstrate that IRM enables us to utilize pretrained skills far more effectively than previous skill selection methods on both the Fetch tabletop and Franka Kitchen robot manipulation benchmarks.
翻译:尽管无监督技能发现在自主获取行为基元方面展现出潜力,但任务无关的技能预训练与下游任务感知的微调之间仍存在方法论上的显著脱节。我们提出内在奖励匹配(Intrinsic Reward Matching, IRM),该方法通过$\textit{技能判别器}$(skill discriminator)这一常被丢弃的预训练模型组件,将这两个学习阶段统一起来。传统方法直接在策略层面微调预训练智能体,往往依赖昂贵的环境仿真来凭经验确定最优技能。然而,任务最简洁完备的描述往往就是奖励函数本身,而技能学习方法通过判别器学习与技能策略对应的$\textit{内在}$奖励函数。我们提出利用技能判别器来$\textit{匹配}$内在奖励与下游任务奖励,从而无需环境样本即可确定未见任务的最优技能,进而以更高的样本效率进行微调。此外,我们将IRM泛化到面向复杂长时程任务的技能序列编排中,并在Fetch桌面操作与Franka Kitchen机器人操作基准任务上证明:相较于先前技能选择方法,IRM能更有效地利用预训练技能。