Enhancing on-device large language models (LLMs) with contextual information from local data enables personalized and task-aware generation, powering use cases such as intelligent assistants and UI agents. While recent developments in neural processors have substantially improved the efficiency of prefill on mobile devices, the token-by-token generation process still suffers from high latency and limited hardware utilization due to its inherently memory-bound characteristics. This work presents sd.npu, a mobile inference framework that integrates speculative decoding with dynamic hardware scheduling to accelerate context-aware text generation on mobile devices. The framework introduces three synergistic components: (1) adaptive execution scheduling, which dynamically balances compute graphs between prefill and decoding phases; (2) context-aligned drafting, which improves speculative efficiency through lightweight online calibration to current tasks; and (3) hardware-efficient draft extension, which reuses and expands intermediate sequences to improve processing parallelism and reduce verification cost. Experiments on multiple smartphones and representative workloads show consistent improvements of up to 3.8x in generation speed and 4.7x in energy efficiency compared with existing mobile inference solutions. Component-level analysis further validates the contribution of each optimization.
翻译:通过整合本地数据的上下文信息来增强设备端大型语言模型,可实现个性化和任务感知的生成,为智能助手和UI代理等用例提供支持。尽管神经处理器的最新发展已显著提升了移动设备上预填充阶段的效率,但由于其固有的内存受限特性,逐令牌生成过程仍然面临高延迟和硬件利用率有限的问题。本研究提出了sd.npu,一个集成了推测解码与动态硬件调度的移动推理框架,旨在加速移动设备上的上下文感知文本生成。该框架引入了三个协同组件:(1) 自适应执行调度,动态平衡预填充和解码阶段的计算图;(2) 上下文对齐草稿生成,通过轻量级在线校准当前任务来提升推测效率;(3) 硬件高效的草稿扩展,重用并扩展中间序列以提高处理并行度并降低验证成本。在多种智能手机和代表性工作负载上的实验表明,与现有移动推理解决方案相比,该框架在生成速度上实现了最高3.8倍的提升,在能效上实现了最高4.7倍的提升。组件级分析进一步验证了每项优化的贡献。