This paper surveys research in the rapidly growing field of enhancing large language models (LLMs) with reinforcement learning (RL), a technique that enables LLMs to improve their performance by receiving feedback in the form of rewards based on the quality of their outputs, allowing them to generate more accurate, coherent, and contextually appropriate responses. In this work, we make a systematic review of the most up-to-date state of knowledge on RL-enhanced LLMs, attempting to consolidate and analyze the rapidly growing research in this field, helping researchers understand the current challenges and advancements. Specifically, we (1) detail the basics of RL; (2) introduce popular RL-enhanced LLMs; (3) review researches on two widely-used reward model-based RL techniques: Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF); and (4) explore Direct Preference Optimization (DPO), a set of methods that bypass the reward model to directly use human preference data for aligning LLM outputs with human expectations. We will also point out current challenges and deficiencies of existing methods and suggest some avenues for further improvements. Project page of this work can be found at: \url{https://github.com/ShuheWang1998/Reinforcement-Learning-Enhanced-LLMs-A-Survey}.
翻译:本文综述了利用强化学习增强大语言模型这一快速发展领域的研究。该技术使大语言模型能够基于其输出质量获得奖励形式的反馈,从而提升性能,生成更准确、连贯且符合语境的响应。本研究系统回顾了强化学习增强大语言模型的最新知识进展,尝试整合并分析该领域快速增长的研究成果,以帮助研究者理解当前面临的挑战与前沿进展。具体而言,我们(1)详述强化学习的基础原理;(2)介绍主流的强化学习增强大语言模型;(3)综述两种广泛使用的基于奖励模型的强化学习技术:基于人类反馈的强化学习与基于人工智能反馈的强化学习;(4)探讨直接偏好优化方法——这类方法绕过奖励模型,直接利用人类偏好数据使大语言模型输出与人类期望对齐。我们还将指出现有方法面临的挑战与不足,并提出若干改进方向。本工作的项目页面可见:\url{https://github.com/ShuheWang1998/Reinforcement-Learning-Enhanced-LLMs-A-Survey}。