Preference-based reinforcement learning (PbRL) provides a natural way to align RL agents' behavior with human desired outcomes, but is often restrained by costly human feedback. To improve feedback efficiency, most existing PbRL methods focus on selecting queries to maximally improve the overall quality of the reward model, but counter-intuitively, we find that this may not necessarily lead to improved performance. To unravel this mystery, we identify a long-neglected issue in the query selection schemes of existing PbRL studies: Query-Policy Misalignment. We show that the seemingly informative queries selected to improve the overall quality of reward model actually may not align with RL agents' interests, thus offering little help on policy learning and eventually resulting in poor feedback efficiency. We show that this issue can be effectively addressed via near on-policy query and a specially designed hybrid experience replay, which together enforce the bidirectional query-policy alignment. Simple yet elegant, our method can be easily incorporated into existing approaches by changing only a few lines of code. We showcase in comprehensive experiments that our method achieves substantial gains in both human feedback and RL sample efficiency, demonstrating the importance of addressing query-policy misalignment in PbRL tasks.
翻译:基于偏好的强化学习(PbRL)为将强化学习智能体行为与人类期望结果对齐提供了自然途径,但常受限于高昂的人类反馈成本。为提高反馈效率,现有PbRL方法多专注于选择能最大化提升奖励模型整体质量的查询,然而反直觉的是,我们发现这未必能带来性能提升。为揭示这一现象,我们指出现有PbRL研究中长期被忽视的查询选择机制问题:查询-策略失配。研究表明,那些为提升奖励模型整体质量而选择的看似信息丰富的查询,实际上可能与强化学习智能体的兴趣不匹配,从而对策略学习助益甚微,最终导致反馈效率低下。我们证明,通过近同策略查询与特别设计的混合经验回放机制,可有效解决该问题,二者协同实现了双向的查询-策略对齐。该方法简洁而精妙,仅需修改数行代码即可融入现有框架。综合实验表明,我们的方法在人类反馈效率与强化学习样本效率上均取得显著提升,印证了解决PbRL任务中查询-策略失配问题的重要性。