With the advent of large datasets, offline reinforcement learning (RL) is a promising framework for learning good decision-making policies without the need to interact with the real environment. However, offline RL requires the dataset to be reward-annotated, which presents practical challenges when reward engineering is difficult or when obtaining reward annotations is labor-intensive. In this paper, we introduce Optimal Transport Reward labeling (OTR), an algorithm that assigns rewards to offline trajectories, with a few high-quality demonstrations. OTR's key idea is to use optimal transport to compute an optimal alignment between an unlabeled trajectory in the dataset and an expert demonstration to obtain a similarity measure that can be interpreted as a reward, which can then be used by an offline RL algorithm to learn the policy. OTR is easy to implement and computationally efficient. On D4RL benchmarks, we show that OTR with a single demonstration can consistently match the performance of offline RL with ground-truth rewards.
翻译:随着大型数据集的兴起,离线强化学习(offline RL)成为一种无需与真实环境交互即可学习良好决策策略的有前景框架。然而,离线强化学习要求数据集带有奖励标注,这在奖励工程困难或获取奖励标注劳动密集时带来了实际挑战。本文提出最优传输奖励标注(OTR)算法,该算法通过少量高质量示范轨迹为离线轨迹分配奖励。OTR的核心思想是利用最优传输计算数据集中未标注轨迹与专家示范轨迹之间的最优对齐,从而获得可解释为奖励的相似性度量,随后可由离线强化学习算法用于学习策略。OTR易于实现且计算高效。在D4RL基准测试中,我们证明仅需单个示范轨迹,OTR即可持续匹配使用真实奖励的离线强化学习性能。