Many imitation learning (IL) algorithms employ inverse reinforcement learning (IRL) to infer the intrinsic reward function that an expert is implicitly optimizing for based on their demonstrated behaviors. However, in practice, IRL-based IL can fail to accomplish the underlying task due to a misalignment between the inferred reward and the objective of the task. In this paper, we address the susceptibility of IL to such misalignment by introducing a semi-supervised reward design paradigm called Protagonist Antagonist Guided Adversarial Reward (PAGAR). PAGAR-based IL trains a policy to perform well under mixed reward functions instead of a single reward function as in IRL-based IL. We identify the theoretical conditions under which PAGAR-based IL can avoid the task failures caused by reward misalignment. We also present a practical on-and-off policy approach to implementing PAGAR-based IL. Experimental results show that our algorithm outperforms standard IL baselines in complex tasks and challenging transfer settings.
翻译:许多模仿学习算法采用逆强化学习从专家演示行为中推断其隐式优化的内在奖励函数。然而在实践中,基于逆强化学习的模仿学习可能因推断奖励与任务目标之间的失调而无法完成潜在任务。本文通过引入一种名为"主角-对抗引导对抗奖励"的半监督奖励设计范式,解决了模仿学习对这种失调的敏感性。与逆强化学习模仿学习仅依赖单一奖励函数不同,基于PAGAR的模仿学习训练策略在混合奖励函数下获得良好表现。我们确定了基于PAGAR的模仿学习能够避免由奖励失调导致任务失败的理论条件,并提出了实现该方法的实用在线-离线策略算法。实验结果表明,在复杂任务和具有挑战性的迁移场景中,本算法优于标准模仿学习基线方法。