Preference-based Reinforcement Learning (PbRL) has demonstrated remarkable efficacy in aligning rewards with human intentions. However, a significant challenge lies in the need of substantial human labels, which is costly and time-consuming. Additionally, the expensive preference data obtained from prior tasks is not typically reusable for subsequent task learning, leading to extensive labeling for each new task. In this paper, we propose a novel zero-shot preference-based RL algorithm that leverages labeled preference data from source tasks to infer labels for target tasks, eliminating the requirement for human queries. Our approach utilizes Gromov-Wasserstein distance to align trajectory distributions between source and target tasks. The solved optimal transport matrix serves as a correspondence between trajectories of two tasks, making it possible to identify corresponding trajectory pairs between tasks and transfer the preference labels. However, learning directly from inferred labels that contains a fraction of noisy labels will result in an inaccurate reward function, subsequently affecting policy performance. To this end, we introduce Robust Preference Transformer, which models the rewards as Gaussian distributions and incorporates reward uncertainty in addition to reward mean. The empirical results on robotic manipulation tasks of Meta-World and Robomimic show that our method has strong capabilities of transferring preferences between tasks and learns reward functions from noisy labels robustly. Furthermore, we reveal that our method attains near-oracle performance with a small proportion of scripted labels.
翻译:偏好强化学习在将奖励与人类意图对齐方面展现出显著效果。然而,其面临的一个重大挑战是需要大量人工标注,这既昂贵又耗时。此外,从先前任务中获取的昂贵偏好数据通常无法直接用于后续任务的学习,导致每个新任务都需要大量标注。本文提出一种新颖的零样本偏好强化学习算法,该算法利用源任务中的已标注偏好数据推断目标任务的标签,消除了对人工查询的需求。我们的方法采用Gromov-Wasserstein距离对齐源任务与目标任务之间的轨迹分布。求解得到的最优传输矩阵可作为两任务轨迹间的对应关系,使得识别任务间的对应轨迹对并迁移偏好标签成为可能。然而,直接从不含噪声但包含部分错误标签的推断标签中学习会导致奖励函数不准确,进而影响策略性能。为此,我们引入鲁棒偏好Transformer,该模型将奖励建模为高斯分布,并在奖励均值基础上引入奖励不确定性。在Meta-World和Robomimic的机器人操作任务上的实验结果表明,我们的方法具有强大的任务间偏好迁移能力,并能从含噪声标签中鲁棒地学习奖励函数。此外,我们揭示出本方法仅需少量脚本化标签即可达到接近oracle的性能。