Learning from human feedback has been shown to be effective at aligning language models with human preferences. Past work has often relied on Reinforcement Learning from Human Feedback (RLHF), which optimizes the language model using reward scores assigned from a reward model trained on human preference data. In this work we show how the recently introduced Sequence Likelihood Calibration (SLiC), can also be used to effectively learn from human preferences (SLiC-HF). Furthermore, we demonstrate this can be done with human feedback data collected for a different model, similar to off-policy, offline RL data. Automatic and human evaluation experiments on the TL;DR summarization task show that SLiC-HF significantly improves supervised fine-tuning baselines. Furthermore, SLiC-HF presents a competitive alternative to the PPO RLHF implementation used in past work while being much simpler to implement, easier to tune and more computationally efficient in practice.
翻译:研究表明,从人类反馈中学习能有效使语言模型与人类偏好对齐。以往工作常依赖于基于人类反馈的强化学习(RLHF),该方法通过由人类偏好数据训练的奖励模型生成的奖励分数来优化语言模型。在本研究中,我们展示了近期提出的序列似然校准(SLiC)方法同样可有效用于从人类偏好中学习(SLiC-HF)。进一步,我们证明该方法能利用为其他模型收集的人类反馈数据(类似于离策略、离线强化学习数据)进行学习。在TL;DR摘要任务上的自动评估与人工评估实验表明,SLiC-HF显著优于监督微调基线。此外,与以往工作中使用的PPO RLHF方案相比,SLiC-HF在实现更简单、调参更便捷且计算效率更高的情况下,展现出具有竞争力的替代性能。