Deploying large language models (LLMs) on mobile devices increasingly relies on heterogeneous execution, yet no prior study has systematically characterized NPU effectiveness at the operator and pipeline level. We present the first stage-aware, multi-level benchmarking study of mobile LLM inference on a CPU-NPU heterogeneous SoC. We introduce an OPMASK-based controlled pipeline decomposition methodology that isolates communication, quantization, and computation overheads within the NPU execution path. Our results reveal a counter-intuitive stage-level performance reversal: CPUs outperform NPUs in the compute-intensive Prefill stage (up to 1.6x), while NPUs provide only limited acceleration in the memory-bound Decode stage (1.05-1.2x). We further show that scheduling overhead and cross-backend fallback reduce the practical benefits of NPU offloading. For the energy trend, increasing NPU offloading leads to higher energy consumption (up to 51%). Based on these findings, we derive design guidelines for NPU architects targeting on-device LLM inference.
翻译:在移动设备上部署大语言模型(LLM)日益依赖异构执行,然而尚无研究系统性地在算子和流水线层面刻画NPU的有效性。我们首次提出一种面向阶段感知、多层次的移动LLM推理基准测试研究,基于CPU-NPU异构片上系统(SoC)展开。我们引入一种基于OPMASK的受控流水线分解方法,将NPU执行路径中的通信、量化和计算开销隔离开来。实验结果显示一种反直觉的阶段级性能反转:在计算密集型的预填充(Prefill)阶段,CPU性能优于NPU(最高达1.6倍),而在内存受限的解码(Decode)阶段,NPU仅提供有限加速(1.05-1.2倍)。我们进一步证明,调度开销和跨后端回退机制削弱了NPU卸载的实际收益。在能耗趋势方面,增加NPU卸载会导致能耗升高(最高达51%)。基于上述发现,我们为面向端侧LLM推理的NPU架构师提炼出设计指导准则。