As large language models (LLMs) continue to grow by scaling laws, reinforcement learning from human feedback (RLHF) has gained significant attention due to its outstanding performance. However, unlike pretraining or fine-tuning a single model, scaling reinforcement learning from human feedback (RLHF) for training large language models poses coordination challenges across four models. We present OpenRLHF, an open-source framework enabling efficient RLHF scaling. Unlike existing RLHF frameworks that co-locate four models on the same GPUs, OpenRLHF re-designs scheduling for the models beyond 70B parameters using Ray, vLLM, and DeepSpeed, leveraging improved resource utilization and diverse training approaches. Integrating seamlessly with Hugging Face, OpenRLHF provides an out-of-the-box solution with optimized algorithms and launch scripts, which ensures user-friendliness. OpenRLHF implements RLHF, DPO, rejection sampling, and other alignment techniques. Empowering state-of-the-art LLM development, OpenRLHF's code is available at https://github.com/OpenLLMAI/OpenRLHF.
翻译:随着大语言模型(LLMs)依据缩放定律持续增长,基于人类反馈的强化学习(RLHF)因其卓越性能而受到广泛关注。然而,与预训练或微调单一模型不同,针对大语言模型训练的RLHF扩展面临着跨四个模型的协同挑战。我们提出了OpenRLHF,这是一个支持高效RLHF扩展的开源框架。与现有将四个模型共置于相同GPU上的RLHF框架不同,OpenRLHF利用Ray、vLLM和DeepSpeed,为超过700亿参数的模型重新设计了调度方案,从而提升了资源利用率并支持多样化的训练方法。OpenRLHF与Hugging Face无缝集成,提供开箱即用的解决方案,包含优化算法和启动脚本,确保了用户友好性。该框架实现了RLHF、DPO、拒绝采样等多种对齐技术。为支持最先进的大语言模型开发,OpenRLHF的代码已公开于https://github.com/OpenLLMAI/OpenRLHF。