Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and related methods; (2) overview techniques to understand, improve, and complement RLHF in practice; and (3) propose auditing and disclosure standards to improve societal oversight of RLHF systems. Our work emphasizes the limitations of RLHF and highlights the importance of a multi-faceted approach to the development of safer AI systems.
翻译:从人类反馈中进行的强化学习(RLHF)是一种训练人工智能系统以使其与人类目标对齐的技术。RLHF已成为微调最先进大型语言模型(LLMs)的核心方法。尽管其广受欢迎,但针对其缺陷的系统性梳理工作在公共领域相对较少。本文中,我们(1)调查了RLHF及相关方法的开放问题和基本局限;(2)概述了在实践中理解、改进和补充RLHF的技术;(3)提出了用于加强社会对RLHF系统监督的审计与披露标准。我们的工作强调了RLHF的局限性,并凸显了以多维度方法开发更安全人工智能系统的重要性。