Multi-SPIN: Multi-Access Speculative Inference for Cooperative Token Generation at the Edge

Speculative inference (SPIN) was originally developed as an efficient architecture to accelerate Large Language Models (LLMs). In this work, we propose its distributed deployment to enable cooperative token generation in a multiuser edge system; its advantage is to effectively balance computational loads between resource-constrained devices and servers. The resulting architecture, termed Multi-access SPIN (Multi-SPIN), utilizes on-device small language models to generate and upload candidate token drafts, while an edge server operates the LLM to verify them in parallel batches. Given the severe heterogeneity in users' computation and communication capabilities, the draft length emerges as a critical control variable that influences node-level computation loads and multi-access latency, thereby governing the sum token goodput. Consequently, considering frequency-division multiple access, we investigate the problem of multi-access draft control, a joint optimization of draft-length control and bandwidth allocation to maximize sum token goodput. We examine two cases: (1) homogeneous draft lengths across users to facilitate server-side batching, and (2) heterogeneous draft lengths to introduce a new dimension for goodput enhancement. By developing decomposition methods, we reduce these complex optimizations into tractable sub-problems, which allow efficient draft control algorithms to be derived in closed form. Our analysis shows that the optimal bandwidth allocation compensates users with weaker computation-and-communication capabilities in the homogeneous case due to the batching synchronization requirements, whereas its heterogeneous-case counterpart rewards users with higher acceptance rates by relaxing such requirements. Experiments using Llama-2 and Qwen3.5 model pairs across diverse tasks demonstrate that Multi-SPIN improves goodput by up to 88% over heterogeneity-agnostic baselines.

翻译：投机推理（SPIN）最初被设计为一种用于加速大型语言模型的高效架构。本文提出其分布式部署方案，以实现多用户边缘系统中的协同令牌生成；其优势在于能够有效平衡资源受限设备与服务器之间的计算负载。由此产生的架构被称为多接入SPIN（Multi-SPIN），它利用设备端的小型语言模型生成并上传候选令牌草稿，同时由边缘服务器运行大型语言模型对其并行批量验证。考虑到用户计算与通信能力的严重异质性，草稿长度成为关键控制变量，它影响节点级计算负载与多接入延迟，进而控制令牌总有效吞吐率。因此，在频分多址接入场景下，我们研究了多接入草稿控制问题，即联合优化草稿长度控制与带宽分配以最大化令牌总有效吞吐率。我们考察了两种情形：（1）用户间采用同质草稿长度以促进服务器端批量处理，（2）用户间采用异质草稿长度以引入有效吞吐率提升的新维度。通过开发分解方法，我们将这些复杂优化问题简化为可处理的子问题，从而推导出闭合形式的草稿控制高效算法。分析表明：在同质情形下，由于批处理同步要求，最优带宽分配会对计算与通信能力较弱的用户进行补偿；而在异质情形下，通过放松同步要求，最优带宽分配会奖励具有更高接受率的用户。使用Llama-2与Qwen3.5模型对在多种任务上的实验表明，Multi-SPIN相比忽略异质性的基线方法可将有效吞吐率提升高达88%。