Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning (RL) that learns from human feedback instead of relying on an engineered reward function. Building on prior work on the related setting of preference-based reinforcement learning (PbRL), it stands at the intersection of artificial intelligence and human-computer interaction. This positioning offers a promising avenue to enhance the performance and adaptability of intelligent systems while also improving the alignment of their objectives with human values. The training of Large Language Models (LLMs) has impressively demonstrated this potential in recent years, where RLHF played a decisive role in targeting the model's capabilities toward human objectives. This article provides a comprehensive overview of the fundamentals of RLHF, exploring the intricate dynamics between machine agents and human input. While recent focus has been on RLHF for LLMs, our survey adopts a broader perspective, examining the diverse applications and wide-ranging impact of the technique. We delve into the core principles that underpin RLHF, shedding light on the symbiotic relationship between algorithms and human feedback, and discuss the main research trends in the field. By synthesizing the current landscape of RLHF research, this article aims to provide researchers as well as practitioners with a comprehensive understanding of this rapidly growing field of research.
翻译:基于人类反馈的强化学习(RLHF)是强化学习(RL)的一种变体,它通过人类反馈而非依赖工程设计奖励函数进行学习。该领域建立在偏好型强化学习(PbRL)相关先前工作基础上,处于人工智能与人机交互的交叉领域。这种定位为增强智能系统的性能与适应性、同时提升其目标与人类价值观的一致性提供了有效途径。近年来大语言模型(LLMs)的训练令人惊叹地证明了这一潜力——RLHF在引导模型能力符合人类目标方面发挥了决定性作用。本文全面概述了RLHF的基本原理,深入探讨了机器代理与人类输入之间的复杂动态。虽然近期研究聚焦于LLMs中的RLHF应用,但本综述采用更广阔的视角,考察该技术的多样化应用及其广泛影响。我们剖析了支撑RLHF的核心原理,揭示了算法与人类反馈的共生关系,并讨论了该领域的主要研究趋势。通过综合RLHF研究现状,本文旨在为研究人员及从业者提供对这一快速发展研究领域的系统性认知。