With the rapid advancement of large language models (LLMs), efficiently serving LLM inference under limited GPU resources has become a critical challenge. Recently, an increasing number of studies have explored applying serverless computing paradigms to LLM serving in order to maximize resource utilization. However, LLM inference workloads are highly diverse, and modern GPU clusters are inherently heterogeneous, making it necessary to dynamically adjust deployment configurations online to better adapt to the elastic and dynamic nature of serverless environments. At the same time, enabling such online reconfiguration is particularly challenging due to the stateful nature of LLM inference and the massive size of model parameters. In this paper, we propose a dynamic pipeline reconfiguration approach that enables online adjustment of pipeline configurations while minimizing service downtime and performance degradation. Our method allows the system to select the optimal pipeline configuration in response to changing workloads. Experimental results on heterogeneous GPU platforms, including NVIDIA A100 and L40s, demonstrate that our migration mechanism incurs less than 50 ms of service downtime, while introducing under 10% overhead on both time-to-first-token (TTFT) and time-per-output-token (TPOT).
翻译:随着大语言模型(LLMs)的快速发展,在有限GPU资源下高效提供LLM推理服务已成为关键挑战。近期,越来越多的研究探索将无服务器计算范式应用于LLM服务,以实现资源利用率最大化。然而,LLM推理工作负载具有高度多样性,现代GPU集群本质上是异构的,因此需要在线动态调整部署配置,以更好地适应无服务器环境的弹性与动态特性。同时,由于LLM推理的有状态特性和模型参数的庞大规模,实现在线重配置尤为困难。本文提出一种动态流水线重配置方法,能够在最小化服务中断时间和性能下降的前提下,实现在线流水线配置调整。该方法使系统能够根据变化的工作负载选择最优流水线配置。在包含NVIDIA A100和L40s的异构GPU平台上进行的实验表明,我们的迁移机制产生的服务中断时间低于50毫秒,同时对首词生成时间(TTFT)和单词输出时间(TPOT)带来的开销均低于10%。