Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning (RL) that learns from human feedback instead of relying on an engineered reward function. Building on prior work on the related setting of preference-based reinforcement learning (PbRL), it stands at the intersection of artificial intelligence and human-computer interaction. This positioning offers a promising avenue to enhance the performance and adaptability of intelligent systems while also improving the alignment of their objectives with human values. The training of large language models (LLMs) has impressively demonstrated this potential in recent years, where RLHF played a decisive role in directing the model's capabilities toward human objectives. This article provides a comprehensive overview of the fundamentals of RLHF, exploring the intricate dynamics between RL agents and human input. While recent focus has been on RLHF for LLMs, our survey adopts a broader perspective, examining the diverse applications and wide-ranging impact of the technique. We delve into the core principles that underpin RLHF, shedding light on the symbiotic relationship between algorithms and human feedback, and discuss the main research trends in the field. By synthesizing the current landscape of RLHF research, this article aims to provide researchers as well as practitioners with a comprehensive understanding of this rapidly growing field of research.
翻译:基于人类反馈的强化学习(RLHF)是强化学习(RL)的一种变体,它从人类反馈中学习,而非依赖人工设计的奖励函数。建立在偏好强化学习(PbRL)相关研究的基础上,RLHF 处于人工智能与人机交互的交叉领域。这种定位为提高智能系统的性能和适应性提供了有前景的途径,同时也能更好地使其目标与人类价值观对齐。近年来,大型语言模型(LLMs)的训练令人印象深刻地展示了这一潜力,其中 RLHF 在引导模型能力朝向人类目标方面发挥了决定性作用。本文全面概述了 RLHF 的基本原理,探讨了 RL 智能体与人类输入之间复杂的动态关系。虽然最近的研究重点集中在面向 LLMs 的 RLHF 上,但我们的综述采用了更广阔的视角,审视了该技术的多样化应用及广泛影响。我们深入探讨了支撑 RLHF 的核心原则,揭示了算法与人类反馈之间的共生关系,并讨论了该领域的主要研究趋势。通过综合当前 RLHF 研究的全景,本文旨在为研究人员和实践者提供对这一快速发展的研究领域的全面理解。