The complexity of designing reward functions has been a major obstacle to the wide application of deep reinforcement learning (RL) techniques. Describing an agent's desired behaviors and properties can be difficult, even for experts. A new paradigm called reinforcement learning from human preferences (or preference-based RL) has emerged as a promising solution, in which reward functions are learned from human preference labels among behavior trajectories. However, existing methods for preference-based RL are limited by the need for accurate oracle preference labels. This paper addresses this limitation by developing a method for crowd-sourcing preference labels and learning from diverse human preferences. The key idea is to stabilize reward learning through regularization and correction in a latent space. To ensure temporal consistency, a strong constraint is imposed on the reward model that forces its latent space to be close to the prior distribution. Additionally, a confidence-based reward model ensembling method is designed to generate more stable and reliable predictions. The proposed method is tested on a variety of tasks in DMcontrol and Meta-world and has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback, paving the way for real-world applications of RL methods.
翻译:设计奖励函数的复杂性一直是深度强化学习(RL)技术广泛应用的主要障碍。即使对专家而言,描述智能体的期望行为和特性也颇具挑战。一种称为"基于人类偏好的强化学习"(或偏好型RL)的新范式已成为具有前景的解决方案,该方法从行为轨迹的人类偏好标签中学习奖励函数。然而,现有的偏好型RL方法受限于对精确的"专家级"偏好标签的依赖。本文通过开发一种众包偏好标签并从多样化人类偏好中学习的方法来克服这一局限。其核心思想是通过潜在空间中的正则化与校正来稳定奖励学习过程。为确保时间一致性,我们对奖励模型施加了强约束,迫使潜在空间接近先验分布。此外,还设计了一种基于置信度的奖励模型集成方法以生成更稳定可靠的预测。所提方法在DMcontrol和Meta-world的多种任务上进行了测试,在从多样化反馈中学习时,该方法相较于现有偏好型RL算法展现出一致且显著的性能提升,为RL方法的实际应用铺平了道路。