Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences. However, gathering high-quality human preference labels can be a time-consuming and expensive endeavor. RL from AI Feedback (RLAIF), introduced by Bai et al., offers a promising alternative that leverages a powerful off-the-shelf LLM to generate preferences in lieu of human annotators. Across the tasks of summarization, helpful dialogue generation, and harmless dialogue generation, RLAIF achieves comparable or superior performance to RLHF, as rated by human evaluators. Furthermore, RLAIF demonstrates the ability to outperform a supervised fine-tuned baseline even when the LLM preference labeler is the same size as the policy. In another experiment, directly prompting the LLM for reward scores achieves superior performance to the canonical RLAIF setup, where LLM preference labels are first distilled into a reward model. Finally, we conduct extensive studies on techniques for generating aligned AI preferences. Our results suggest that RLAIF can achieve human-level performance, offering a potential solution to the scalability limitations of RLHF.
翻译:基于人类反馈的强化学习已被证明能有效使大型语言模型与人类偏好对齐。然而,收集高质量的人类偏好标签往往耗时且成本高昂。Bai等人提出的基于AI反馈的强化学习提供了一种有前景的替代方案,该方法利用现成的强大大型语言模型替代人类标注员生成偏好。在摘要生成、有益对话生成和无害对话生成任务中,RLAIF在人类评估者的评分下达到了与RLHF相当甚至更优的性能。此外,即使偏好标注的LLM与策略模型规模相同,RLAIF仍能超越经过监督微调的基线模型。在另一项实验中,直接提示LLM输出奖励分数的方法优于标准RLAIF框架(即先将LLM的偏好标签蒸馏为奖励模型)。最后,我们对生成对齐AI偏好的技术进行了广泛研究。研究结果表明,RLAIF能够达到人类级性能,为解决RLHF的可扩展性限制提供了可行方案。