Learning from human preferences is crucial for language models (LMs) to effectively cater to human needs and societal values. Previous research has made notable progress by leveraging human feedback to follow instructions. However, these approaches rely primarily on online learning techniques like Proximal Policy Optimization (PPO), which have been proven unstable and challenging to tune for language models. Moreover, PPO requires complex distributed system implementation, hindering the efficiency of large-scale distributed training. In this study, we propose an offline learning from human feedback framework to align LMs without interacting with environments. Specifically, we explore filtering alignment (FA), reward-weighted regression (RWR), and conditional alignment (CA) to align language models to human preferences. By employing a loss function similar to supervised fine-tuning, our methods ensure more stable model training than PPO with a simple machine learning system~(MLSys) and much fewer (around 9\%) computing resources. Experimental results demonstrate that conditional alignment outperforms other offline alignment methods and is comparable to PPO.
翻译:从人类偏好中学习对于语言模型有效满足人类需求和社会价值观至关重要。以往研究通过利用人类反馈来遵循指令取得了显著进展。然而,这些方法主要依赖在线学习技术(如近端策略优化PPO),已被证明在语言模型中不稳定且难以调参。此外,PPO需要复杂的分布式系统实现,阻碍了大规模分布式训练的效率。本研究提出一种离线人类反馈学习框架,无需与环境交互即可实现语言模型对齐。具体而言,我们探索了过滤对齐、奖励加权回归和条件对齐三种方法,使语言模型与人类偏好对齐。通过采用与监督微调相似的损失函数,我们的方法在简单的机器学习系统上实现了比PPO更稳定的模型训练,且计算资源消耗减少约9%。实验结果表明,条件对齐优于其他离线对齐方法,其性能与PPO相当。