Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization

We tackle the problem of aligning pre-trained large language models (LMs) with human preferences. If we view text generation as a sequential decision-making problem, reinforcement learning (RL) appears to be a natural conceptual framework. However, using RL for LM-based generation faces empirical challenges, including training instability due to the combinatorial action space, as well as a lack of open-source libraries and benchmarks customized for LM alignment. Thus, a question rises in the research community: is RL a practical paradigm for NLP? To help answer this, we first introduce an open-source modular library, RL4LMs (Reinforcement Learning for Language Models), for optimizing language generators with RL. The library consists of on-policy RL algorithms that can be used to train any encoder or encoder-decoder LM in the HuggingFace library (Wolf et al. 2020) with an arbitrary reward function. Next, we present the GRUE (General Reinforced-language Understanding Evaluation) benchmark, a set of 6 language generation tasks which are supervised not by target strings, but by reward functions which capture automated measures of human preference.GRUE is the first leaderboard-style evaluation of RL algorithms for NLP tasks. Finally, we introduce an easy-to-use, performant RL algorithm, NLPO (Natural Language Policy Optimization)} that learns to effectively reduce the combinatorial action space in language generation. We show 1) that RL techniques are generally better than supervised methods at aligning LMs to human preferences; and 2) that NLPO exhibits greater stability and performance than previous policy gradient methods (e.g., PPO (Schulman et al. 2017)), based on both automatic and human evaluations.

翻译：我们解决了将预训练大型语言模型与人类偏好对齐的问题。若将文本生成视为序列决策问题，强化学习似乎是自然的概念框架。然而，将强化学习用于基于语言模型的生成面临经验性挑战，包括因组合动作空间导致的训练不稳定，以及缺乏针对语言模型对齐定制的开源库和基准。因此，研究界出现了一个问题：强化学习是否是自然语言处理的实用范式？为解答这一问题，我们首先引入开源模块化库RL4LMs（面向语言模型的强化学习），用于通过强化学习优化语言生成器。该库包含在线策略强化学习算法，可结合任意奖励函数训练HuggingFace库中的任意编码器或编码器-解码器语言模型。其次，我们提出GRUE（通用强化语言理解评估）基准，该基准包含6个语言生成任务，其监督信号并非目标字符串，而是捕捉人类偏好自动化指标的奖励函数。GRUE是首个面向自然语言处理任务的强化学习算法排行榜式评估体系。最后，我们引入一种易于使用且高性能的强化学习算法NLPO（自然语言策略优化），该算法能有效缩减语言生成中的组合动作空间。实验表明：1）强化学习技术在将语言模型与人类偏好对齐方面整体优于监督方法；2）基于自动评估与人工评估，NLPO相较于先前策略梯度方法（如PPO）展现出更优的稳定性与性能。