Distributed inference serves as a promising approach to enabling the inference of large language models (LLMs) at the network edge. It distributes the inference process to multiple devices to ensure that the LLMs can fit into the device memory. Recent pipeline-based approaches have the potential to parallelize communication and computation, which helps reduce inference latency. However, the benefit diminishes when the inference request at the network edge is sparse, where pipeline is typically at low utilization. To enable efficient distributed LLM inference at the edge, we propose \textbf{FlowSpec}, a pipeline-parallel tree-based speculative decoding framework. FlowSpec incorporates three key mechanisms to improve decoding efficiency: 1) score-based step-wise verification prioritizes more important draft tokens to bring earlier accepted tokens; 2) efficient draft management to prune invalid tokens while maintaining correct causal relationship during verification; 3) dynamic draft expansion strategies to supply high-quality speculative inputs. These techniques work in concert to enhance both pipeline utilization and speculative efficiency. We evaluate FlowSpec on a real-world testbed with other baselines. Experimental results demonstrate that our proposed framework significantly improves inference speed across diverse models and configurations, achieving speedup ratios 1.37$\times$-1.73$\times$ compared to baselines. Our code is publicly available at \href{https://github.com/Leosang-lx/FlowSpec#}{https://github.com/Leosang-lx/FlowSpec\#}.
翻译:分布式推理作为一种有前景的方法,能够实现在网络边缘进行大语言模型(LLM)的推理。它将推理过程分布到多个设备上,以确保大语言模型能够适配设备内存。近期基于流水线的方法具有并行化通信与计算的潜力,有助于降低推理延迟。然而,当网络边缘的推理请求稀疏时,流水线通常处于低利用率状态,其优势便会减弱。为了实现边缘场景下高效的分布式大语言模型推理,我们提出了 \textbf{FlowSpec},一个基于流水线并行与树结构的推测解码框架。FlowSpec 融合了三种关键机制以提升解码效率:1)基于评分的逐步验证机制,优先验证更重要的草稿令牌,以更早地获得被接受的令牌;2)高效的草稿管理机制,在验证过程中剪除无效令牌,同时维持正确的因果依赖关系;3)动态草稿扩展策略,以提供高质量的推测输入。这些技术协同工作,共同提升了流水线利用率和推测效率。我们在真实世界的测试平台上评估了 FlowSpec 与其他基线方法。实验结果表明,我们提出的框架在不同模型和配置下均显著提升了推理速度,相较于基线方法实现了 1.37$\times$ 至 1.73$\times$ 的加速比。我们的代码已在 \href{https://github.com/Leosang-lx/FlowSpec#}{https://github.com/Leosang-lx/FlowSpec\#} 公开。