Recent advancements and widespread adoption of Large Language Models (LLMs) in both industry and academia have catalyzed significant demand for LLM serving. However, traditional cloud services incur high costs, while on-device inference alone faces challenges due to limited resources. Edge-cloud collaboration emerges as a key research direction to combine the strengths of both paradigms, yet efficiently utilizing limited network bandwidth while fully leveraging and balancing the computational capabilities of edge devices and the cloud remains an open problem. To address these challenges, we propose Pipelined Collaborative Speculative Decoding Framework (PicoSpec), a novel, general-purpose, and training-free speculative decoding framework for LLM edge-cloud collaborative inference. We design an asynchronous pipeline that resolves the mutual waiting problem inherent in vanilla speculative decoding within edge collaboration scenarios, which concurrently executes a Small Language Model (SLM) on the edge device and a LLM in the cloud. Meanwhile, to mitigate the significant communication latency caused by transmitting vocabulary distributions, we introduce separate rejection sampling with sparse compression, which completes the rejection sampling with only a one-time cost of transmitting the compressed vocabulary. Experimental results demonstrate that our solution outperforms baseline and existing methods, achieving up to 2.9 speedup.
翻译:近期大语言模型在工业界和学术界的快速发展与广泛应用,催生了对LLM服务的巨大需求。然而,传统云服务成本高昂,而仅依靠终端设备推理又面临资源受限的挑战。边缘-云协作作为融合两种范式优势的关键研究方向应运而生,但如何在充分利用并平衡边缘设备与云端计算能力的同时,高效利用有限的网络带宽,仍是一个待解决的问题。针对这些挑战,我们提出了一种面向LLM边缘-云协作推理的新型通用免训练推测解码框架——流水线协同推测解码框架(PicoSpec)。我们设计了异步流水线,解决了边缘协作场景中原始推测解码固有的相互等待问题,该流水线在边缘设备上并发执行小型语言模型(SLM),在云端并发执行LLM。同时,为缓解传输词汇分布带来的显著通信延迟,我们引入了基于稀疏压缩的分离式拒绝采样,仅需单次传输压缩词汇的成本即可完成拒绝采样。实验结果表明,我们的方案优于基线及现有方法,最高可实现2.9倍的加速。