Reinforcement learning from human feedback (RLHF) is effective at aligning large language models (LLMs) to human preferences, but gathering high quality human preference labels is a key bottleneck. We conduct a head-to-head comparison of RLHF vs. RL from AI Feedback (RLAIF) - a technique where preferences are labeled by an off-the-shelf LLM in lieu of humans, and we find that they result in similar improvements. On the task of summarization, human evaluators prefer generations from both RLAIF and RLHF over a baseline supervised fine-tuned model in ~70% of cases. Furthermore, when asked to rate RLAIF vs. RLHF summaries, humans prefer both at equal rates. These results suggest that RLAIF can yield human-level performance, offering a potential solution to the scalability limitations of RLHF.
翻译:基于人类反馈的强化学习(RLHF)能有效使大型语言模型(LLMs)与人类偏好对齐,但收集高质量的人类偏好标签是一个关键瓶颈。我们对RLHF与基于AI反馈的强化学习(RLAIF)——一种由现成LLM替代人类进行偏好标注的技术——进行了正面比较,发现二者能带来相似的改进。在摘要任务中,人类评估者在大约70%的案例中更倾向于选择RLAIF和RLHF生成的摘要,而非基线监督微调模型的结果。此外,当要求对RLAIF与RLHF生成的摘要进行评分时,人类对两者的偏好程度相当。这些结果表明,RLAIF能够达到人类级别的性能,为解决RLHF的可扩展性限制提供了潜在方案。