Large language model (LLM) serving faces the dual challenge of meeting strict user-specific service-level objectives (SLOs) while minimizing computational cost under dynamic, multi-task workloads. Existing approaches either rely on static scheduling policies or focus on single-task settings, limiting their applicability in real-world deployments with heterogeneous requests, variable prompt lengths, and elastic scaling requirements. We present HFX, a production LLM serving system that jointly optimizes request scheduling and elastic scaling across model replicas to satisfy diverse SLOs. HFX introduces a \textbf{scheduler} that performs proactive budget estimation and prioritization to ensure SLO compliance for both new and in-flight requests. HFX also integrates a \textbf{scaler} that supports fast device-to-device (D2D) weight transfer, reducing cold-start latency. Additionally, the system supports both colocated and disaggregated prefill/decode deployments, enabling adaptation to diverse workload patterns and cloud environments. Through extensive experiments on multi-task workloads, we demonstrate consistently higher SLO attainment, lower end-to-end latency, and lower NPU usage cost by up to 4.44$\times$, 65.82\%, and 49.81\%, respectively, compared to state-of-the-art systems. Our results highlight the effectiveness of SLO-aware scheduling and scaling in practical LLM serving, providing a robust framework for cost-efficient and SLO-compliant deployments.
翻译:大型语言模型服务面临双重挑战:在动态多任务工作负载下,既要满足用户严格的个性化服务等级目标,又要最小化计算成本。现有方法或依赖静态调度策略,或仅关注单任务场景,限制了其在含异构请求、可变提示长度及弹性扩展需求的实际部署中的适用性。我们提出HFX——一个生产级大语言模型服务系统,通过联合优化跨模型副本的请求调度与弹性扩展,以满足多样化SLO要求。HFX引入**调度器**,执行主动预算估计与优先级排序,确保新请求与执行中请求均符合SLO约束;同时集成**扩展器**,支持快速设备间权重迁移,有效降低冷启动延迟。此外,系统支持预填充/解码的共置与分离部署模式,可适应多样化工作负载模式与云环境。在多任务工作负载上的大量实验表明,相较现有先进系统,本方案的SLO达成率、端到端延迟及NPU使用成本分别提升多达4.44倍、降低65.82%与49.81%。实验结果凸显了SLO感知调度与扩展在实用大模型服务中的有效性,为成本高效且符合SLO约束的部署提供了稳健框架。