Diffusion large language models (dLLMs) accelerate generation by denoising multiple tokens in parallel, making them attractive for latency-sensitive mobile inference. However, repeated denoising introduces substantial computation on smartphones. Mobile neural processing units (NPUs) offer high-throughput dense matrix computation, but efficiently exploiting them remains challenging: token commitment shrinks per-block effective workloads, token revision complicates KV cache reuse, and limited NPU-visible address space incurs costly remapping and data transfer overheads. In this paper, we propose llada.cpp, the first NPU-aware inference framework for accelerating dLLMs on smartphones. llada.cpp aligns block-wise dLLM inference with the execution characteristics of mobile NPUs through three techniques. (1) Multi-Block Speculative Decoding fills the shrinking workload in late-stage current-block decoding with speculative future-block tokens. (2) Dual-Path Progressive Revision keeps committed tokens revisable until stable and refreshes unstable tokens through a CPU-side path without stalling dense NPU execution. (3) Swap-Optimized Memory Runtime compacts NPU-visible address layouts and overlaps data staging with NPU computation to reduce remapping and transfer overheads. We implement llada.cpp as an end-to-end framework and evaluate it across diverse hardware platforms and dLLM workloads. llada.cpp reduces LLaDA-8B generation latency by 17x-42x over the CPU baseline with prefix KV cache reuse, while preserving generation quality.
翻译:扩散大语言模型(dLLM)通过并行去噪多个令牌来加速生成,使其适用于对延迟敏感的移动推理。然而,重复去噪过程在智能手机上引入了大量计算负担。移动神经网络处理单元(NPU)提供高吞吐量的密集矩阵计算,但高效利用它们仍面临挑战:令牌提交会缩减每块的有效工作量,令牌修正使KV缓存复用复杂化,且NPU可见地址空间有限导致高昂的重映射与数据传输开销。本文提出llada.cpp,这是首个面向智能手机的NPU感知推理框架,用于加速dLLM。llada.cpp通过三项技术将分块dLLM推理与移动NPU的执行特性对齐:(1)多块推测解码用推测的未来块令牌填充当前块解码后期阶段缩减的工作量;(2)双路径渐进修正使已提交的令牌在稳定前保持可修改状态,并通过CPU侧路径刷新不稳定令牌而无需阻塞密集的NPU执行;(3)交换优化内存运行时压缩NPU可见地址布局,并将数据预取与NPU计算重叠以降低重映射与传输开销。我们以端到端框架形式实现llada.cpp,并在多样化硬件平台与dLLM工作负载上进行评估。llada.cpp在采用前缀KV缓存复用时,相比CPU基线将LLaDA-8B的生成延迟降低17倍至42倍,同时保持生成质量。