Speculative decoding enhances the inference efficiency of large language models (LLMs) by generating drafts using a small draft language model (DLM) and verifying them in batches with a large target language model (TLM). However, adaptive drafting inference on a mobile single-NPU-PIM system faces idle overhead in traditional operator-level synchronous execution and wasted computation in asynchronous execution due to fluctuations in draft length. This paper introduces AHASD, a task-level asynchronous mobile NPU-PIM heterogeneous architecture for speculative decoding. Notably, AHASD achieves parallel drafting on the PIM and verification on a single NPU through task-level DLM-TLM decoupling and specifically, it incorporates Entropy-History-Aware Drafting Control and Time-Aware Pre-Verification Control to dynamically manage adaptive drafting algorithm execution and pre-verification timing, suppressing invalid drafting based on low-confidence drafts. Additionally, AHASD integrates Attention Algorithm Units and Gated Task Scheduling Units within LPDDR5-PIM to enable attention link localization and sub-microsecond task switching on the PIM side. Experimental results for different LLMs and adaptive drafting algorithms show that AHASD achieves up to 4.2$\times$ in throughput and 5.6$\times$ in energy efficiency improvements over a GPU-only baseline, and 1.5$\times$ in throughput and 1.24$\times$ in energy efficiency gains over the state-of-the-art GPU+PIM baseline, with hardware overhead below 3% of the DRAM area.
翻译:投机解码通过使用小型草稿语言模型生成草稿,并由大型目标语言模型批量验证,提升了大型语言模型的推理效率。然而,在移动端单NPU-PIM系统中,自适应草稿推理面临传统操作符级同步执行中的空闲开销,以及因草稿长度波动导致异步执行中计算浪费的问题。本文提出AHASD——一种面向投机解码的任务级异步移动NPU-PIM异构架构。值得关注的是,AHASD通过任务级草稿语言模型与目标语言模型解耦,在PIM上实现并行草稿生成,并在单个NPU上完成验证。具体而言,它集成了基于熵与历史感知的草稿控制模块以及时间感知的预验证控制模块,以动态管理自适应草稿算法执行与预验证时机,从而抑制基于低置信度草稿的无效生成。此外,AHASD在LPDDR5-PIM中集成了注意力算法单元与门控任务调度单元,实现了PIM端的注意力链接定位与亚微秒级任务切换。针对不同大语言模型及自适应草稿算法的实验结果表明,相较于纯GPU基线,AHASD在吞吐量上提升最高达4.2倍,能效提升最高达5.6倍;相较于最先进的GPU+PIM基线,吞吐量提升1.5倍,能效提升1.24倍,且硬件开销低于DRAM面积的3%。