Preference-based reinforcement learning (PbRL) is an approach that enables RL agents to learn from preference, which is particularly useful when formulating a reward function is challenging. Existing PbRL methods generally involve a two-step procedure: they first learn a reward model based on given preference data and then employ off-the-shelf reinforcement learning algorithms using the learned reward model. However, obtaining an accurate reward model solely from preference information, especially when the preference is from human teachers, can be difficult. Instead, we propose a PbRL algorithm that directly learns from preference without requiring any reward modeling. To achieve this, we adopt a contrastive learning framework to design a novel policy scoring metric that assigns a high score to policies that align with the given preferences. We apply our algorithm to offline RL tasks with actual human preference labels and show that our algorithm outperforms or is on par with the existing PbRL methods. Notably, on high-dimensional control tasks, our algorithm surpasses offline RL methods that learn with ground-truth reward information. Finally, we show that our algorithm can be successfully applied to fine-tune large language models.
翻译:基于偏好的强化学习(PbRL)是一种使强化学习(RL)智能体能够从偏好中学习的方法,这在奖励函数难以构建时尤为有用。现有的PbRL方法通常涉及两步流程:首先根据给定的偏好数据学习奖励模型,然后利用已学习的奖励模型采用现成的强化学习算法。然而,仅凭偏好信息(尤其是来自人类教师的偏好)获得准确的奖励模型可能十分困难。为此,我们提出了一种无需任何奖励建模即可直接从偏好中学习的PbRL算法。为实现这一目标,我们采用对比学习框架设计了一种新颖的策略评分指标,该指标能为符合给定偏好的策略赋予高分。我们将该算法应用于具有真实人类偏好标签的离线RL任务,结果表明,我们的算法性能优于或持平于现有PbRL方法。值得注意的是,在高维控制任务上,我们的算法超越了利用真实奖励信息进行学习的离线RL方法。最后,我们展示了该算法可成功应用于大语言模型的微调。