We tackle the problem of aligning pre-trained large language models (LMs) with human preferences. If we view text generation as a sequential decision-making problem, reinforcement learning (RL) appears to be a natural conceptual framework. However, using RL for LM-based generation faces empirical challenges, including training instability due to the combinatorial action space, as well as a lack of open-source libraries and benchmarks customized for LM alignment. Thus, a question rises in the research community: is RL a practical paradigm for NLP? To help answer this, we first introduce an open-source modular library, RL4LMs (Reinforcement Learning for Language Models), for optimizing language generators with RL. The library consists of on-policy RL algorithms that can be used to train any encoder or encoder-decoder LM in the HuggingFace library (Wolf et al. 2020) with an arbitrary reward function. Next, we present the GRUE (General Reinforced-language Understanding Evaluation) benchmark, a set of 6 language generation tasks which are supervised not by target strings, but by reward functions which capture automated measures of human preference.GRUE is the first leaderboard-style evaluation of RL algorithms for NLP tasks. Finally, we introduce an easy-to-use, performant RL algorithm, NLPO (Natural Language Policy Optimization)} that learns to effectively reduce the combinatorial action space in language generation. We show 1) that RL techniques are generally better than supervised methods at aligning LMs to human preferences; and 2) that NLPO exhibits greater stability and performance than previous policy gradient methods (e.g., PPO (Schulman et al. 2017)), based on both automatic and human evaluations.
翻译:我们解决了将预训练大型语言模型与人类偏好对齐的问题。若将文本生成视为序列决策问题,强化学习似乎是自然的概念框架。然而,将强化学习用于基于语言模型的生成面临经验性挑战,包括因组合动作空间导致的训练不稳定,以及缺乏针对语言模型对齐定制的开源库和基准。因此,研究界出现了一个问题:强化学习是否是自然语言处理的实用范式?为解答这一问题,我们首先引入开源模块化库RL4LMs(面向语言模型的强化学习),用于通过强化学习优化语言生成器。该库包含在线策略强化学习算法,可结合任意奖励函数训练HuggingFace库中的任意编码器或编码器-解码器语言模型。其次,我们提出GRUE(通用强化语言理解评估)基准,该基准包含6个语言生成任务,其监督信号并非目标字符串,而是捕捉人类偏好自动化指标的奖励函数。GRUE是首个面向自然语言处理任务的强化学习算法排行榜式评估体系。最后,我们引入一种易于使用且高性能的强化学习算法NLPO(自然语言策略优化),该算法能有效缩减语言生成中的组合动作空间。实验表明:1)强化学习技术在将语言模型与人类偏好对齐方面整体优于监督方法;2)基于自动评估与人工评估,NLPO相较于先前策略梯度方法(如PPO)展现出更优的稳定性与性能。