Imitation Learning (IL) is a widely used framework for learning imitative behavior from demonstrations. It is especially appealing for solving complex real-world tasks where handcrafting reward function is difficult, or when the goal is to mimic human expert behavior. However, the learned imitative policy can only follow the behavior in the demonstration. When applying the imitative policy, we may need to customize the policy behavior to meet different requirements coming from diverse downstream tasks. Meanwhile, we still want the customized policy to maintain its imitative nature. To this end, we formulate a new problem setting called policy customization. It defines the learning task as training a policy that inherits the characteristics of the prior policy while satisfying some additional requirements imposed by a target downstream task. We propose a novel and principled approach to interpret and determine the trade-off between the two task objectives. Specifically, we formulate the customization problem as a Markov Decision Process (MDP) with a reward function that combines 1) the inherent reward of the demonstration; and 2) the add-on reward specified by the downstream task. We propose a novel framework, Residual Q-learning, which can solve the formulated MDP by leveraging the prior policy without knowing the inherent reward or value function of the prior policy. We derive a family of residual Q-learning algorithms that can realize offline and online policy customization, and show that the proposed algorithms can effectively accomplish policy customization tasks in various environments.
翻译:模仿学习是一种广泛使用的框架,用于从示范中学习模仿行为。该框架特别适用于解决难以手工设计奖励函数的复杂现实任务,或当目标为模仿人类专家行为时。然而,学习到的模仿策略只能遵循示范中的行为。在应用模仿策略时,我们可能需要根据下游多样任务的不同需求对策略行为进行定制。同时,我们仍希望定制后的策略保持其模仿特性。为此,我们提出了一种称为策略定制的新问题设定。该设定将学习任务定义为训练一个继承先前策略特性、同时满足目标下游任务附加要求的策略。我们提出了一种新颖且原理性的方法来解释和确定两个任务目标之间的权衡。具体而言,我们将定制问题建模为一个马尔可夫决策过程,其奖励函数结合了:1)示范的内在奖励;以及2)下游任务指定的附加奖励。我们提出了残差Q学习这一新颖框架,该框架能够利用先验策略求解所构建的MDP,而无需知晓先验策略的内在奖励或值函数。我们推导出一系列残差Q学习算法,可实现离线和在线策略定制,并证明所提算法能够在多种环境中有效完成策略定制任务。