Improving Language Models with Advantage-based Offline Policy Gradients

Language Models (LMs) achieve substantial language capabilities when finetuned using Reinforcement Learning with Human Feedback (RLHF). However, RLHF is an unstable and data-hungry process that continually requires new high-quality LM-generated data for finetuning. We introduce Advantage-Leftover Lunch RL (A-LoL), a new class of offline policy gradient algorithms that enable RL training on any pre-existing data. By assuming the entire LM output sequence as a single action, A-LoL allows incorporating sequence-level classifiers or human-designed scoring functions as rewards. Subsequently, by using LM's internal sequence-level value estimate, A-LoL filters negative advantage (low-quality) data points during training, making it resilient to noise. Overall, A-LoL is an easy-to-implement LM training recipe that is sample-efficient and stable. We demonstrate the effectiveness of A-LoL and its variants with a set of four different language generation tasks. We compare against both online RL (PPO) and recent preference-based (DPO, PRO) and reward-based (GOLD) offline RL baselines. On the commonly-used RLHF benchmark, Helpful and Harmless Assistant (HHA), LMs trained with A-LoL methods achieve the highest diversity while also being rated more safe and helpful than baselines according to humans. Additionally, in the remaining three tasks, A-LoL could optimize multiple distinct reward functions even when using noisy or suboptimal training data. We also release our experimental code. https://github.com/abaheti95/LoL-RL

翻译：语言模型（LMs）通过基于人类反馈的强化学习（RLHF）进行微调后，能获得强大的语言能力。然而，RLHF是一个不稳定且数据密集的过程，需要持续生成高质量的新LM数据用于微调。我们提出优势-剩余午餐强化学习（A-LoL），这是一类新的离线策略梯度算法，能够对任意已有数据进行强化学习训练。通过将整个LM输出序列视为单个动作，A-LoL允许纳入序列级分类器或人工设计的评分函数作为奖励。随后，利用LM内部的序列级价值估计，A-LoL在训练过程中过滤负优势（低质量）数据点，使其对噪声具有鲁棒性。总体而言，A-LoL是一种易于实现、样本高效且稳定的LM训练方案。我们通过四组不同的语言生成任务展示了A-LoL及其变体的有效性。我们与在线强化学习（PPO）以及近期基于偏好（DPO、PRO）和基于奖励（GOLD）的离线强化学习基线进行了对比。在广泛使用的RLHF基准测试——有用且无害助手（HHA）中，经过A-LoL方法训练的LM在实现最高多样性的同时，还被人类评估为比基线更安全、更有帮助。此外，在其余三项任务中，即使在训练数据存在噪声或次优的情况下，A-LoL也能优化多个不同的奖励函数。我们还发布了实验代码：https://github.com/abaheti95/LoL-RL