Learning from human preferences is crucial for language models (LMs) to effectively cater to human needs and societal values. Previous research has made notable progress by leveraging human feedback to follow instructions. However, these approaches rely primarily on online reinforcement learning (RL) techniques like Proximal Policy Optimization (PPO), which have been proven unstable and challenging to tune for language models. Moreover, PPO requires complex distributed system implementation, hindering the efficiency of large-scale distributed training. In this study, we propose an offline reinforcement learning from human feedback (RLHF) framework to align LMs using pre-generated samples without interacting with RL environments. Specifically, we explore maximum likelihood estimation (MLE) with filtering, reward-weighted regression (RWR), and Decision Transformer (DT) to align language models to human preferences. By employing a loss function similar to supervised fine-tuning, our methods ensure more stable model training than PPO with a simple machine learning system~(MLSys) and much fewer (around 12.3\%) computing resources. Experimental results demonstrate the DT alignment outperforms other Offline RLHF methods and is better than PPO.
翻译:从人类偏好中学习对于语言模型有效满足人类需求和社会价值观至关重要。已有研究通过利用人类反馈进行指令遵循取得了显著进展。然而,这些方法主要依赖于诸如近端策略优化(PPO)等在线强化学习技术,已被证明在语言模型训练中不稳定且难以调优。此外,PPO需要复杂的分布式系统实现,阻碍了大规模分布式训练的效能。本研究提出了一种基于人类反馈的离线强化学习框架,通过使用预生成的样本实现语言模型对齐,无需与强化学习环境交互。具体而言,我们探索了带过滤的最大似然估计、奖励加权回归以及决策转换器(DT)等方法,以将语言模型与人类偏好对齐。通过采用类似于监督微调的损失函数,我们的方法确保了相比PPO更稳定的模型训练,且仅需简单的机器学习系统与极少的计算资源(约12.3%)。实验结果表明,DT对齐方法优于其他离线RLHF方法,且性能优于PPO。