Federated inference enhances LLM performance in edge computing through weighted averaging of distributed model predictions. However, autoregressive LLM inference requires frequent full-model forward passes across workers, severely limiting decoding throughput. Distributed deployment further aggravates this due to a communication bottleneck: each worker must transmit full token probability distributions per draft token, dominating end-to-end latency. To address these challenges, we introduce speculative decoding to enable parallel LLM processing and propose a top-K compressed transmission scheme with two server-side reconstruction strategies. We theoretically analyze the robustness of our method in terms of local reconstruction error, aggregation bias, and acceptance-rate bias, and derive corresponding bounds. Experiments demonstrate that our scheme achieves high generation fidelity while significantly reducing communication overhead.
翻译:联邦推理通过加权平均分布式模型预测提升边缘计算中大语言模型(LLM)的性能。然而,自回归式LLM推理需要工作节点间频繁执行全模型前向传播,严重制约了解码吞吐量。分布式部署进一步加剧了通信瓶颈:每个工作节点必须为每个草稿令牌传输完整的令牌概率分布,主导了端到端延迟。为解决这些挑战,我们引入推测解码以实现LLM并行处理,并提出一种包含两种服务端重构策略的Top-K压缩传输方案。我们从局部重构误差、聚合偏差及接受率偏差角度理论分析了本方法的鲁棒性,推导出相应的边界条件。实验表明,本方案在显著降低通信开销的同时实现了高保真度生成。