As Large Language Models (LLMs) become increasingly accessible to end users, an ever-growing number of inference requests are initiated from edge devices and computed on centralized GPU clusters. However, the resulting exponential growth in computation workload is placing significant strain on data centers, while edge devices remain largely underutilized, leading to imbalanced workloads and resource inefficiency across the network. Integrating edge devices into the LLM inference process via speculative decoding helps balance the workload between the edge and the cloud, while maintaining lossless prediction accuracy. In this paper, we identify and formalize two critical bottlenecks that limit the efficiency and scalability of distributed speculative LLM serving: Wasted Drafting Time and Verification Interference. To address these challenges, we propose WISP, an efficient and SLO-aware distributed LLM inference system that consists of an intelligent speculation controller, a verification time estimator, and a verification batch scheduler. These components collaboratively enhance drafting efficiency and optimize verification request scheduling on the server. Extensive numerical results show that WISP improves system capacity by up to 2.1x and 4.1x, and increases system goodput by up to 1.94x and 3.7x, compared to centralized serving and SLED, respectively.
翻译:随着大语言模型(LLMs)日益普及,来自边缘设备的推理请求不断增长,并在集中式GPU集群上执行计算。然而,由此导致的计算负载指数级增长给数据中心带来了巨大压力,而边缘设备仍大量闲置,造成网络范围内负载失衡与资源低效问题。通过推测解码将边缘设备集成到LLM推理过程中,有助于平衡边缘与云端的工作负载,同时保持无损的预测精度。本文识别并形式化了限制分布式推测性LLM服务效率与可扩展性的两个关键瓶颈:草稿时间浪费与验证干扰。针对这些挑战,我们提出WISP——一种高效的SLO感知型分布式LLM推理系统,包含智能推测控制器、验证时间估计器和验证批调度器。这些组件协同提升草稿生成效率并优化服务器端的验证请求调度。大量数值结果表明,与集中式服务和SLED相比,WISP将系统容量分别提升至2.1倍和4.1倍,系统有效吞吐量分别提升至1.94倍和3.7倍。