Language Models (LMs) achieve substantial language capabilities when finetuned using Reinforcement Learning with Human Feedback (RLHF). However, RLHF is an unstable and data-hungry process that continually requires new high-quality LM-generated data for finetuning. We introduce Advantage-Leftover Lunch RL (A-LoL), a new class of offline policy gradient algorithms that enable RL training on any pre-existing data. By assuming the entire LM output sequence as a single action, A-LoL allows incorporating sequence-level classifiers or human-designed scoring functions as rewards. Subsequently, by using LM's internal sequence-level value estimate, A-LoL filters negative advantage (low-quality) data points during training, making it resilient to noise. Overall, A-LoL is an easy-to-implement LM training recipe that is sample-efficient and stable. We demonstrate the effectiveness of A-LoL and its variants with a set of four different language generation tasks. We compare against both online RL (PPO) and recent preference-based (DPO, PRO) and reward-based (GOLD) offline RL baselines. On the commonly-used RLHF benchmark, Helpful and Harmless Assistant (HHA), LMs trained with A-LoL methods achieve the highest diversity while also being rated more safe and helpful than baselines according to humans. Additionally, in the remaining three tasks, A-LoL could optimize multiple distinct reward functions even when using noisy or suboptimal training data. We also release our experimental code. https://github.com/abaheti95/LoL-RL
翻译:语言模型(LMs)通过基于人类反馈的强化学习(RLHF)进行微调后,能获得强大的语言能力。然而,RLHF是一个不稳定且数据密集的过程,需要持续生成高质量的新LM数据用于微调。我们提出优势-剩余午餐强化学习(A-LoL),这是一类新的离线策略梯度算法,能够对任意已有数据进行强化学习训练。通过将整个LM输出序列视为单个动作,A-LoL允许纳入序列级分类器或人工设计的评分函数作为奖励。随后,利用LM内部的序列级价值估计,A-LoL在训练过程中过滤负优势(低质量)数据点,使其对噪声具有鲁棒性。总体而言,A-LoL是一种易于实现、样本高效且稳定的LM训练方案。我们通过四组不同的语言生成任务展示了A-LoL及其变体的有效性。我们与在线强化学习(PPO)以及近期基于偏好(DPO、PRO)和基于奖励(GOLD)的离线强化学习基线进行了对比。在广泛使用的RLHF基准测试——有用且无害助手(HHA)中,经过A-LoL方法训练的LM在实现最高多样性的同时,还被人类评估为比基线更安全、更有帮助。此外,在其余三项任务中,即使在训练数据存在噪声或次优的情况下,A-LoL也能优化多个不同的奖励函数。我们还发布了实验代码:https://github.com/abaheti95/LoL-RL