As large language models (LLMs) continue to grow by scaling laws, reinforcement learning from human feedback (RLHF) has gained significant attention due to its outstanding performance. However, unlike pretraining or fine-tuning a single model, scaling reinforcement learning from human feedback (RLHF) for training large language models poses coordination challenges across four models. We present OpenRLHF, an open-source framework enabling efficient RLHF scaling. Unlike existing RLHF frameworks that co-locate four models on the same GPUs, OpenRLHF re-designs scheduling for the models beyond 70B parameters using Ray, vLLM, and DeepSpeed, leveraging improved resource utilization and diverse training approaches. Integrating seamlessly with Hugging Face, OpenRLHF provides an out-of-the-box solution with optimized algorithms and launch scripts, which ensures user-friendliness. OpenRLHF implements RLHF, DPO, rejection sampling, and other alignment techniques. Empowering state-of-the-art LLM development, OpenRLHF's code is available at https://github.com/OpenLLMAI/OpenRLHF.
翻译:随着大语言模型(LLMs)通过缩放定律持续发展,基于人类反馈的强化学习(RLHF)因其卓越性能而备受关注。然而,与预训练或微调单个模型不同,将RLHF扩展至大语言模型训练需要协调四个模型之间的协同挑战。我们提出OpenRLHF,一个开源框架,支持高效的RLHF扩展。不同于将四个模型共存于同一GPU的现有RLHF框架,OpenRLHF利用Ray、vLLM和DeepSpeed重新设计了超过70B参数模型的调度策略,通过优化资源利用和多样化训练方法提升效率。该框架与Hugging Face无缝集成,提供开箱即用的解决方案,包含优化算法和启动脚本,确保用户友好性。OpenRLHF实现了RLHF、DPO、拒绝采样及其他对齐技术。为推动最先进大语言模型的发展,OpenRLHF的代码已开源在https://github.com/OpenLLMAI/OpenRLHF。