Speculative decoding enhances the inference efficiency of large language models (LLMs) by generating drafts using a small draft language model (DLM) and verifying them in batches with a large target language model (TLM). However, adaptive drafting inference on a mobile single-NPU-PIM system faces idle overhead in traditional operator-level synchronous execution and wasted computation in asynchronous execution due to fluctuations in draft length. This paper introduces AHASD, a task-level asynchronous mobile NPU-PIM heterogeneous architecture for speculative decoding. Notably, AHASD achieves parallel drafting on the PIM and verification on a single NPU through task-level DLM-TLM decoupling and specifically, it incorporates Entropy-History-Aware Drafting Control and Time-Aware Pre-Verification Control to dynamically manage adaptive drafting algorithm execution and pre-verification timing, suppressing invalid drafting based on low-confidence drafts. Additionally, AHASD integrates Attention Algorithm Units and Gated Task Scheduling Units within LPDDR5-PIM to enable attention link localization and sub-microsecond task switching on the PIM side. Experimental results for different LLMs and adaptive drafting algorithms show that AHASD achieves up to 4.2$\times$ in throughput and 5.6$\times$ in energy efficiency improvements over a GPU-only baseline, and 1.5$\times$ in throughput and 1.24$\times$ in energy efficiency gains over the state-of-the-art GPU+PIM baseline, with hardware overhead below 3\% of the DRAM area.
翻译:推测解码通过使用小型草稿语言模型生成草稿,并由大型目标语言模型批量验证,从而提升大语言模型的推理效率。然而,在移动单NPU-PIM系统上执行自适应草稿推理时,传统算子级同步执行存在空闲开销,而异步执行则因草稿长度波动导致计算浪费。本文提出AHASD——一种面向推测解码的任务级异步移动NPU-PIM异构架构。值得注意的是,AHASD通过任务级DLM-TLM解耦,在PIM上实现并行草稿生成,在单NPU上完成验证;具体而言,它整合了熵-历史感知草稿控制与时序感知预验证控制机制,以动态管理自适应草稿算法执行与预验证时机,基于低置信度草稿抑制无效生成。此外,AHASD在LPDDR5-PIM内集成了注意力算法单元与门控任务调度单元,实现PIM端的注意力链接定位与亚微秒级任务切换。针对不同大语言模型与自适应草稿算法的实验结果表明,与纯GPU基线相比,AHASD可实现最高4.2倍的吞吐量提升和5.6倍的能效提升;与最先进的GPU+PIM基线相比,可实现1.5倍的吞吐量提升和1.24倍的能效提升,且硬件开销低于DRAM面积的3%。