Reinforcement Learning from Human Feedback (RLHF) has been widely applied to Large Language Model (LLM) post-training to align model outputs with human preferences. Recent models, such as DeepSeek-R1, have also shown RLHF's potential to improve LLM reasoning on complex tasks. In RL, inference and training co-exist, creating dynamic resource demands throughout the workflow. Compared to traditional RL, RLHF further challenges training efficiency due to expanding model sizes and resource consumption. Several RLHF frameworks aim to balance flexible abstraction and efficient execution. However, they rely on serverful infrastructures, which struggle with fine-grained resource variability. As a result, during synchronous RLHF training, idle time between or within RL components often causes overhead and resource wastage. To address these issues, we present RLHFless, the first scalable training framework for synchronous RLHF, built on serverless computing environments. RLHFless adapts to dynamic resource demands throughout the RLHF pipeline, pre-computes shared prefixes to avoid repeated computation, and uses a cost-aware actor scaling strategy that accounts for response length variation to find sweet spots with lower cost and higher speed. In addition, RLHFless assigns workloads efficiently to reduce intra-function imbalance and idle time. Experiments on both physical testbeds and a large-scale simulated cluster show that RLHFless achieves up to 1.35x speedup and 44.8% cost reduction compared to the state-of-the-art baseline.
翻译:基于人类反馈的强化学习(RLHF)已被广泛应用于大语言模型(LLM)的后训练阶段,以对齐模型输出与人类偏好。近期模型(如DeepSeek-R1)也表明RLHF在提升LLM复杂任务推理能力方面具有潜力。在强化学习中,推理与训练并存,导致工作流全程存在动态资源需求。相比传统强化学习,RLHF因模型规模扩大和资源消耗增长,对训练效率提出了更大挑战。现有若干RLHF框架致力于平衡灵活抽象与高效执行,但其均依赖于服务器化基础设施,难以应对细粒度资源波动。这导致在同步RLHF训练过程中,强化学习组件间或组件内的空闲时间常引发额外开销与资源浪费。为解决这些问题,我们提出RLHFless——首个基于无服务器计算环境构建的可扩展同步RLHF训练框架。RLHFless能够自适应RLHF流程中的动态资源需求,通过预计算共享前缀避免重复运算,并采用考虑响应长度变化的成本感知执行器扩缩策略,以寻求成本更低、速度更优的平衡点。此外,RLHFless通过高效分配工作负载来减少函数内负载不均与空闲时间。在物理测试平台与大规模模拟集群上的实验表明,相较于现有最优基线方法,RLHFless最高可实现1.35倍的加速比与44.8%的成本降低。