LLM deployment on resource-constrained edge devices faces severe latency constraints, particularly in real-time applications where delayed responses can compromise safety or usability. Among many approaches to mitigate the inefficiencies of sequential token-by-token generation, Speculative Decoding (SD) has emerged as a promising technique. However, SD at the edge is hindered by two major challenges: (1) integrating SD into a compiler-based workflow without sacrificing performance or programmability, and (2) exploiting the heterogeneous compute resources of modern SoCs through carefully designed partitioning strategies. This work addresses these challenges by using an analytical cost model that explores heterogeneous hardware configurations and guides coarse-grained partitioning of LLM subgraphs, particularly with edge-typical short input sequence lengths. The cost model predicts when speculative sampling and heterogeneous execution are jointly beneficial and is validated on an edge device featuring a hexacore Cortex-A CPU and a Mali GPU, revealing up to 1.68$\times$ speedup for translation tasks, closely matching analytic expectations.
翻译:在资源受限的边缘设备上部署大语言模型面临严重的延迟限制,特别是在实时应用中,延迟响应可能危及安全性或可用性。在众多缓解逐令牌顺序生成低效性的方法中,推测解码已成为一项前景广阔的技术。然而,边缘端的推测解码面临两大挑战:(1)在不牺牲性能或可编程性的前提下,将推测解码集成到基于编译器的工作流中;(2)通过精心设计的分区策略,充分利用现代片上系统的异构计算资源。本研究通过采用一种分析成本模型应对这些挑战,该模型探索异构硬件配置,并指导大语言模型子图的粗粒度分区,尤其针对边缘设备典型的短输入序列长度。该成本模型预测推测采样与异构执行何时能共同产生效益,并在配备六核Cortex-A CPU和Mali GPU的边缘设备上得到验证,结果显示在翻译任务中最高可实现1.68$\times$的加速比,与分析预期高度吻合。