Improving Language Models with Advantage-based Offline Policy Gradients

Abstract Language Models (LMs) achieve substantial language capabilities when finetuned using Reinforcement Learning with Human Feedback (RLHF). However, RLHF is an unstable and data-hungry process that continually requires new high-quality LM-generated data for finetuning. We introduce Advantage-Leftover Lunch RL (A-LoL), a new class of offline policy gradient algorithms that enable RL training on any pre-existing data. By assuming the entire LM output sequence as a single action, A-LoL allows incorporating sequence-level classifiers or human-designed scoring functions as rewards. Subsequently, by using LM's internal sequence-level value estimate, A-LoL filters negative advantage (low-quality) data points during training, making it resilient to noise. Overall, A-LoL is an easy-to-implement LM training recipe that is sample-efficient and stable. We demonstrate the effectiveness of A-LoL and its variants with a set of four different language generation tasks. We compare against both online RL (PPO) and recent preference-based (DPO, PRO) and reward-based (GOLD) offline RL baselines. On the commonly-used RLHF benchmark, Helpful and Harmless Assistant (HHA), LMs trained with A-LoL methods achieve the highest diversity while also being rated more safe and helpful than baselines according to humans. Additionally, in the remaining three tasks, A-LoL could optimize multiple distinct reward functions even when using noisy or suboptimal training data. We also release our experimental code. https://github.com/abaheti95/LoL-RL

翻译：摘要：语言模型(LM)在使用基于人类反馈的强化学习(RLHF)进行微调时，能够获得显著的语言能力。然而，RLHF是一个不稳定且需要大量数据的过程，它持续需要新生成的高质量LM数据用于微调。我们提出了优势剩余午餐强化学习(A-LoL)，这是一类新的离线策略梯度算法，能够对任何现有数据进行强化学习训练。通过将整个LM输出序列视为单一动作，A-LoL允许将序列级分类器或人工设计的评分函数作为奖励。随后，通过利用LM内部的序列级价值估计，A-LoL在训练过程中过滤掉负优势(低质量)数据点，使其对噪声具有鲁棒性。总体而言，A-LoL是一种易于实现的LM训练方案，具有样本高效性和稳定性。我们通过四个不同的语言生成任务展示了A-LoL及其变体的有效性。我们将其与在线强化学习(PPO)以及近期基于偏好(DPO、PRO)和基于奖励(GOLD)的离线强化学习基线进行了比较。在常用的RLHF基准测试"有益且无害助手"(HHA)上，使用A-LoL方法训练的LM在实现最高多样性的同时，根据人类评估比基线模型更安全、更有益。此外，在其余三项任务中，即使使用噪声或次优训练数据，A-LoL也能优化多个不同的奖励函数。我们还发布了实验代码：https://github.com/abaheti95/LoL-RL