Offline reinforcement learning (RL) addresses the problem of sequential decision-making by learning optimal policy through pre-collected data, without interacting with the environment. As yet, it has remained somewhat impractical, because one rarely knows the reward explicitly and it is hard to distill it retrospectively. Here, we show that an imitating agent can still learn the desired behavior merely from observing the expert, despite the absence of explicit rewards or action labels. In our method, AILOT (Aligned Imitation Learning via Optimal Transport), we involve special representation of states in a form of intents that incorporate pairwise spatial distances within the data. Given such representations, we define intrinsic reward function via optimal transport distance between the expert's and the agent's trajectories. We report that AILOT outperforms state-of-the art offline imitation learning algorithms on D4RL benchmarks and improves the performance of other offline RL algorithms in the sparse-reward tasks.
翻译:离线强化学习(RL)通过预先收集的数据学习最优策略来解决序列决策问题,无需与环境交互。然而,该方法至今仍略显不切实际,因为奖励通常难以明确获知,且事后提炼奖励也十分困难。在此,我们证明:即使没有显式奖励或动作标签,模仿智能体仍能仅通过观察专家来学习期望行为。我们的方法AILOT(基于最优传输的对齐模仿学习)采用一种特殊的状态表示形式,即意图,其中整合了数据中成对的空间距离。基于这种表示,我们通过专家轨迹与智能体轨迹之间的最优传输距离定义内在奖励函数。实验表明,AILOT在D4RL基准测试中优于最先进的离线模仿学习算法,并在稀疏奖励任务中提升了其他离线RL算法的性能。