Preference-based reinforcement learning (RL) provides a framework to train agents using human feedback through pairwise preferences over pairs of behaviors, enabling agents to learn desired behaviors when it is difficult to specify a numerical reward function. While this paradigm leverages human feedback, it currently treats the feedback as given by a single human user. Meanwhile, incorporating preference feedback from crowds (i.e. ensembles of users) in a robust manner remains a challenge, and the problem of training RL agents using feedback from multiple human users remains understudied. In this work, we introduce Crowd-PrefRL, a framework for performing preference-based RL leveraging feedback from crowds. This work demonstrates the viability of learning reward functions from preference feedback provided by crowds of unknown expertise and reliability. Crowd-PrefRL not only robustly aggregates the crowd preference feedback, but also estimates the reliability of each user within the crowd using only the (noisy) crowdsourced preference comparisons. Most importantly, we show that agents trained with Crowd-PrefRL outperform agents trained with majority-vote preferences or preferences from any individual user in most cases, especially when the spread of user error rates among the crowd is large. Results further suggest that our method can identify minority viewpoints within the crowd.
翻译:偏好型强化学习提供了一种框架,通过人类对行为对之间的成对偏好反馈来训练智能体,使得在难以指定数值奖励函数时,智能体仍能学习到期望行为。尽管该范式利用了人类反馈,但当前研究将反馈视为来自单个用户。与此同时,如何稳健地整合来自众包(即用户群体)的偏好反馈仍是一项挑战,而利用多个用户反馈训练强化学习智能体的问题仍未得到充分研究。本文提出Crowd-PrefRL框架,用于实现基于众包偏好反馈的强化学习。该工作证明了从未知专业程度和可靠性的众包人群提供的偏好反馈中学习奖励函数的可行性。Crowd-PrefRL不仅能稳健地聚合众包偏好反馈,还能仅通过(含噪声的)众包偏好比较来估计每个用户的可靠性。最重要的是,实验表明,在大多数情况下,使用Crowd-PrefRL训练的智能体性能优于基于多数投票偏好或任意单个用户偏好训练的智能体,尤其当用户错误率在众包人群中的分布差异较大时。结果进一步表明,该方法能够识别众包中的少数群体观点。