We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill in this gap and provide a detailed recipe that is easy to reproduce for online iterative RLHF. In particular, since online human feedback is usually infeasible for open-source communities with limited resources, we start by constructing preference models using a diverse set of open-source datasets and use the constructed proxy preference model to approximate human feedback. Then, we discuss the theoretical insights and algorithmic principles behind online iterative RLHF, followed by a detailed practical implementation. Our trained LLM, SFR-Iterative-DPO-LLaMA-3-8B-R, achieves impressive performance on LLM chatbot benchmarks, including AlpacaEval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks such as HumanEval and TruthfulQA. We have shown that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets. Further, we have made our models, curated datasets, and comprehensive step-by-step code guidebooks publicly available. Please refer to https://github.com/RLHFlow/RLHF-Reward-Modeling and https://github.com/RLHFlow/Online-RLHF for more detailed information.
翻译:本技术报告介绍了在线迭代式人类反馈强化学习(RLHF)的工作流程。近期大语言模型(LLM)文献广泛报道,该方法相比离线RLHF在性能上具有显著优势。然而,现有开源RLHF项目仍主要局限于离线学习场景。本技术报告旨在填补这一空白,提供易于复现的在线迭代式RLHF详细解决方案。具体而言,考虑到开源社区资源有限难以获取在线人类反馈,我们首先利用多样化的开源数据集构建偏好模型,并使用构建的代理偏好模型近似模拟人类反馈。接着,我们探讨在线迭代式RLHF的理论见解与算法原理,随后给出详细的实践实现方案。我们训练的LLM模型SFR-Iterative-DPO-LLaMA-3-8B-R在AlpacaEval-2、Arena-Hard和MT-Bench等LLM聊天基准测试,以及HumanEval和TruthfulQA等学术基准测试中均取得优异表现。实验证明,监督微调(SFT)与迭代式RLHF相结合,能够基于完全开源数据集获得最先进性能。此外,我们已将模型、整理的数据集及分步式完整代码指南全部开源。更多详情请参见 https://github.com/RLHFlow/RLHF-Reward-Modeling 和 https://github.com/RLHFlow/Online-RLHF。