Recent advancements and widespread adoption of Large Language Models (LLMs) in both industry and academia have catalyzed significant demand for LLM serving. However, traditional cloud services incur high costs, while on-device inference alone faces challenges due to limited resources. Edge-cloud collaboration emerges as a key research direction to combine the strengths of both paradigms, yet efficiently utilizing limited network bandwidth while fully leveraging and balancing the computational capabilities of edge devices and the cloud remains an open problem. To address these challenges, we propose Pipelined Collaborative Speculative Decoding Framework (PicoSpec), a novel, general-purpose, and training-free speculative decoding framework for LLM edge-cloud collaborative inference. We design an asynchronous pipeline that resolves the mutual waiting problem inherent in vanilla speculative decoding within edge collaboration scenarios, which concurrently executes a Small Language Model (SLM) on the edge device and a LLM in the cloud. Meanwhile, to mitigate the significant communication latency caused by transmitting vocabulary distributions, we introduce separate rejection sampling with sparse compression, which completes the rejection sampling with only a one-time cost of transmitting the compressed vocabulary. Experimental results demonstrate that our solution outperforms baseline and existing methods, achieving up to 2.9 speedup.
翻译:近年来,大型语言模型(LLM)在工业界和学术界的快速发展与广泛应用催生了对LLM服务的巨大需求。然而,传统云服务成本高昂,而纯设备端推理又受限于资源短缺。边缘-云端协作作为融合两种范式优势的关键研究方向应运而生,但如何在有限网络带宽下高效利用并均衡边缘设备与云端的计算能力,仍是一个待解难题。针对这些挑战,我们提出流水线协同推测解码框架(PicoSpec)——一种面向LLM边缘-云端协同推理的新型通用无训练推测解码框架。我们设计了异步流水线,解决了边缘协作场景中传统推测解码固有的相互等待问题,该流水线在边缘设备上并发执行小型语言模型(SLM),同时在云端执行LLM。此外,为缓解传输词汇分布带来的显著通信延迟,我们提出了带稀疏压缩的分离拒绝采样方法,仅需一次性传输压缩词汇即可完成拒绝采样。实验结果表明,我们的方案优于基线及现有方法,最高可实现2.9倍加速。