Reinforcement learning from human feedback (RLHF) can improve the quality of large language model's (LLM) outputs by aligning them with human preferences. We propose a simple algorithm for aligning LLMs with human preferences inspired by growing batch reinforcement learning (RL), which we call Reinforced Self-Training (ReST). Given an initial LLM policy, ReST produces a dataset by generating samples from the policy, which are then used to improve the LLM policy using offline RL algorithms. ReST is more efficient than typical online RLHF methods because the training dataset is produced offline, which allows data reuse. While ReST is a general approach applicable to all generative learning settings, we focus on its application to machine translation. Our results show that ReST can substantially improve translation quality, as measured by automated metrics and human evaluation on machine translation benchmarks in a compute and sample-efficient manner.
翻译:通过人类反馈的强化学习(RLHF)能够使大语言模型(LLM)的输出与人类偏好对齐,从而提升输出质量。受批量强化学习(RL)启发,我们提出了一种简单有效的LLM与人类偏好对齐算法——强化自训练(ReST)。给定初始LLM策略后,ReST通过策略生成样本构建数据集,并利用离线RL算法优化LLM策略。由于训练数据通过离线方式生成并支持重复使用,该方法相比典型在线RLHF更具效率。尽管ReST作为通用方法适用于所有生成式学习场景,本研究重点探讨其在机器翻译中的应用。实验结果表明,在机器翻译基准测试中,ReST能够以计算高效且样本高效的方式,显著提升翻译质量(由自动评估指标和人工评估共同验证)。