RLHFuse: Efficient RLHF Training for Large Language Models with Inter- and Intra-Stage Fusion

Reinforcement Learning from Human Feedback (RLHF) enhances the alignment between LLMs and human preference. The workflow of RLHF typically involves several models and tasks in a series of distinct stages. Existing RLHF training systems view each task as the smallest execution unit thus overlooking the opportunities for subtask-level optimizations. Due to the intrinsic nature of RLHF training, i.e., the data skewness in the generation stage, and the pipeline bubbles in the training stage, existing RLHF systems suffer from low GPU utilization in production deployments. RLHFuse breaks the traditional view of RLHF workflow as a composition of individual tasks, splitting each task into finer-grained subtasks, and performing stage fusion to improve GPU utilization. RLHFuse contains two key ideas. First, for generation and inference tasks, RLHFuse splits them into sample-level subtasks, enabling efficient inter-stage fusion to mitigate the original generation bottleneck dominated by long-tailed samples. Second, for training tasks, RLHFuse breaks them into subtasks of micro-batches. By leveraging the intuition that pipeline execution can be essentially complemented by another pipeline, RLHFuse performs intra-stage fusion to concurrently execute these subtasks in the training stage with a fused pipeline schedule, resulting in fewer pipeline bubbles. In addition, RLHFuse incorporates a series of system optimizations tailored for each stage of RLHF, making it efficient and scalable for our internal product usage. We evaluate RLHFuse on various popular LLMs and the results show that RLHFuse increases the training throughput by up to 3.7x, compared to existing state-of-the-art systems.

翻译：基于人类反馈的强化学习（RLHF）能够有效提升大语言模型与人类偏好的对齐程度。典型的RLHF工作流程包含多个相互独立的阶段，每个阶段涉及不同的模型与任务。现有的RLHF训练系统将每个任务视为最小执行单元，因而忽视了在子任务层面进行优化的可能性。由于RLHF训练固有的特性——即生成阶段的数据偏斜与训练阶段的流水线气泡——现有系统在生产部署中普遍面临GPU利用率低下的问题。RLHFuse打破了将RLHF工作流视为独立任务组合的传统视角，将每个任务拆分为更细粒度的子任务，并通过阶段融合技术提升GPU利用率。RLHFuse包含两项核心创新：首先，针对生成与推理任务，RLHFuse将其拆分为样本级子任务，实现高效的跨阶段融合，从而缓解由长尾样本主导的原始生成瓶颈；其次，针对训练任务，RLHFuse将其拆分为微批次子任务，基于“流水线执行本质上可由另一条流水线补充”的洞见，通过阶段内融合技术，采用融合流水线调度方案在训练阶段并行执行这些子任务，显著减少流水线气泡。此外，RLHFuse还针对RLHF各阶段特点集成了一系列系统优化，使其在我们内部产品应用中兼具高效性与可扩展性。我们在多种主流大语言模型上评估RLHFuse，实验结果表明，相较于现有先进系统，RLHFuse最高可将训练吞吐量提升3.7倍。