The growing demand for on-device large language model (LLM) inference highlights the need for efficient mobile edge computing (MEC) solutions, especially in resource-constrained settings. Speculative decoding offers a promising solution by partitioning token generation between a lightweight draft model on mobile devices and a powerful target model on edge servers, but suffers from communication overhead and asynchronous delays. This paper is the first to propose a unified framework that jointly optimizes user association and resource allocation (UARA) to support efficient parallel speculative decoding. We solve the UARA problem using a multi-agent deep reinforcement learning algorithm. To evaluate our approach under realistic conditions, we conduct experiments using the Sionna simulator. Results show that our method achieves up to 28.0% and an average of 23.7% reduction in end-to-end latency without compromising inference accuracy, enabling scalable and low-latency LLM services in MEC systems.
翻译:设备端大语言模型(LLM)推理需求的增长突显了对高效移动边缘计算(MEC)解决方案的需求,尤其是在资源受限的场景中。推测解码通过在移动设备上的轻量级草稿模型与边缘服务器上的强大目标模型之间划分令牌生成,提供了一种有前景的解决方案,但存在通信开销和异步延迟的问题。本文首次提出了一个统一框架,联合优化用户关联与资源分配(UARA),以支持高效的并行推测解码。我们采用多智能体深度强化学习算法求解UARA问题。为了在真实条件下评估我们的方法,我们使用Sionna模拟器进行了实验。结果表明,在不影响推理准确性的前提下,我们的方法实现了端到端延迟最高降低28.0%,平均降低23.7%,从而在MEC系统中实现了可扩展且低延迟的LLM服务。