Reward specification is one of the most tricky problems in Reinforcement Learning, which usually requires tedious hand engineering in practice. One promising approach to tackle this challenge is to adopt existing expert video demonstrations for policy learning. Some recent work investigates how to learn robot policies from only a single/few expert video demonstrations. For example, reward labeling via Optimal Transport (OT) has been shown to be an effective strategy to generate a proxy reward by measuring the alignment between the robot trajectory and the expert demonstrations. However, previous work mostly overlooks that the OT reward is invariant to temporal order information, which could bring extra noise to the reward signal. To address this issue, in this paper, we introduce the Temporal Optimal Transport (TemporalOT) reward to incorporate temporal order information for learning a more accurate OT-based proxy reward. Extensive experiments on the Meta-world benchmark tasks validate the efficacy of the proposed method. Code is available at: https://github.com/fuyw/TemporalOT
翻译:奖励设定是强化学习中最棘手的问题之一,在实践中通常需要繁琐的人工设计。一种有前景的解决方法是利用现有的专家视频演示进行策略学习。近期一些研究探索如何仅从单个或少量专家视频演示中学习机器人策略。例如,通过最优传输进行奖励标注已被证明是一种有效的策略,它通过度量机器人轨迹与专家演示之间的对齐程度来生成代理奖励。然而,先前的研究大多忽略了最优传输奖励对时序顺序信息的不变性,这可能会给奖励信号带来额外噪声。为解决这一问题,本文引入时序最优传输奖励,以融合时序顺序信息,从而学习更准确的基于最优传输的代理奖励。在Meta-world基准任务上进行的大量实验验证了所提方法的有效性。代码发布于:https://github.com/fuyw/TemporalOT