As Large Language Models (LLMs) become increasingly accessible to end users, an ever-growing number of inference requests are initiated from edge devices and computed on centralized GPU clusters. However, the resulting exponential growth in computation workload is placing significant strain on data centers, while edge devices remain largely underutilized, leading to imbalanced workloads and resource inefficiency across the network. Integrating edge devices into the LLM inference process via speculative decoding helps balance the workload between the edge and the cloud, while maintaining lossless prediction accuracy. In this paper, we identify and formalize two critical bottlenecks that limit the efficiency and scalability of distributed speculative LLM serving: Wasted Drafting Time and Verification Interference. To address these challenges, we propose WISP, an efficient and SLO-aware distributed LLM inference system that consists of an intelligent speculation controller, a verification time estimator, and a verification batch scheduler. These components collaboratively enhance drafting efficiency and optimize verification request scheduling on the server. Extensive numerical results show that WISP improves system capacity by up to 2.1x and 4.1x, and increases system goodput by up to 1.94x and 3.7x, compared to centralized serving and SLED, respectively.
翻译:随着大语言模型日益普及至终端用户,越来越多的推理请求从边缘设备发起并在集中式GPU集群上计算。然而,由此带来的计算工作量指数级增长对数据中心造成了巨大压力,而边缘设备却长期处于低利用率状态,导致网络中的工作负载失衡与资源效率低下。通过推测式解码将边缘设备整合到大语言模型推理过程中,有助于平衡边缘与云端之间的工作负载,同时保持无损的预测精度。本文识别并形式化定义了限制分布式推测式大语言模型服务效率与可扩展性的两个关键瓶颈:草拟时间浪费与验证干扰。为应对这些挑战,我们提出WISP——一个高效且具备服务等级目标感知能力的分布式大语言模型推理系统,其包含智能推测控制器、验证时间估计器与验证批次调度器。这些组件协同工作以提升草拟效率,并优化服务器端的验证请求调度。大量数值实验表明,相较于集中式服务与SLED系统,WISP分别将系统容量最高提升至2.1倍与4.1倍,并将系统有效吞吐量最高提升至1.94倍与3.7倍。