The complexity of designing reward functions has been a major obstacle to the wide application of deep reinforcement learning (RL) techniques. Describing an agent's desired behaviors and properties can be difficult, even for experts. A new paradigm called reinforcement learning from human preferences (or preference-based RL) has emerged as a promising solution, in which reward functions are learned from human preference labels among behavior trajectories. However, existing methods for preference-based RL are limited by the need for accurate oracle preference labels. This paper addresses this limitation by developing a method for crowd-sourcing preference labels and learning from diverse human preferences. The key idea is to stabilize reward learning through regularization and correction in a latent space. To ensure temporal consistency, a strong constraint is imposed on the reward model that forces its latent space to be close to the prior distribution. Additionally, a confidence-based reward model ensembling method is designed to generate more stable and reliable predictions. The proposed method is tested on a variety of tasks in DMcontrol and Meta-world and has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback, paving the way for real-world applications of RL methods.
翻译:设计奖励函数的复杂性一直是深度强化学习技术广泛应用的主要障碍。即使对于专家而言,描述智能体的期望行为和属性也可能十分困难。一种名为“基于人类偏好的强化学习”(或偏好型强化学习)的新范式已成为一种有前景的解决方案,该方法从行为轨迹的人类偏好标签中学习奖励函数。然而,现有偏好型强化学习方法受限于对精确“神谕”偏好标签的需求。本文通过开发一种众包偏好标签并从多样化人类偏好中学习的方法来突破这一限制。其核心思想是通过在潜在空间中进行正则化和校正来稳定奖励学习过程。为确保时间一致性,我们对奖励模型施加了强约束,迫使其潜在空间接近先验分布。此外,还设计了一种基于置信度的奖励模型集成方法,以生成更稳定可靠的预测。所提出的方法在DMcontrol和Meta-world中的多种任务上进行了测试,结果表明,在从多样化反馈中学习时,该方法相较于现有偏好型强化学习算法能实现一致且显著的性能提升,为强化学习方法的实际应用铺平了道路。