This paper presents PipeBoost, a low-latency LLM serving system for multi-GPU (serverless) clusters, which can rapidly launch inference services in response to bursty requests without preemptively over-provisioning GPUs. Many LLM inference tasks rely on the same base model (e.g., LoRA). To leverage this, PipeBoost introduces fault-tolerant pipeline parallelism across both model loading and inference stages. This approach maximizes aggregate PCIe bandwidth and parallel computation across GPUs, enabling faster generation of the first token. PipeBoost also introduces recovery techniques that enable uninterrupted inference services by utilizing the shared advantages of multiple GPUs. Experimental results show that, compared to state-of-the-art low-latency LLM serving systems, PipeBoost reduces inference latency by 31% to 49.8%. For certain models (e.g., OPT-1.3B), PipeBoost achieves cold-start latencies in the range of a few hundred microseconds.
翻译:本文提出PipeBoost,一种面向多GPU(无服务器)集群的低延迟大语言模型服务系统,该系统能够快速启动推理服务以应对突发请求,而无需预先过度配置GPU资源。许多大语言模型推理任务依赖于相同的基础模型(例如LoRA)。为充分利用这一特性,PipeBoost在模型加载与推理阶段引入了具备容错能力的流水线并行机制。该方法通过最大化聚合PCIe带宽及跨GPU并行计算能力,实现了首令牌生成速度的显著提升。PipeBoost还设计了基于多GPU共享优势的恢复技术,确保推理服务不间断运行。实验结果表明,相较于现有最先进的低延迟大语言模型服务系统,PipeBoost将推理延迟降低了31%至49.8%。对于特定模型(如OPT-1.3B),PipeBoost可实现数百微秒量级的冷启动延迟。