TileFuse: A Fused Mixed-Precision Kernel Library for Efficient Quantized LLM Inference on AMD NPUs

With the growing demand for on-device LLM inference, edge SoCs increasingly integrate NPUs to improve performance and energy efficiency under tight power and thermal budgets. However, practical LLM deployment on current client NPUs remains difficult: widely used quantization formats such as AWQ do not map cleanly onto many existing NPU software stacks, which are often proprietary and expose limited low-level control. In this work, we present \textit{TileFuse}, a close-to-metal mixed-precision kernel library for AMD XDNA2 NPUs that targets transformer linear layers in quantized LLM inference. TileFuse brings practical low-bit formats such as AWQ-style W4A16 and W8A16 directly onto XDNA2, rather than forcing the model to be reshaped around an NPU-specific quantization scheme. TileFuse co-designs weight layout, metadata placement, mixed-precision microkernels, and array-level dataflow. Specifically, it fuses unpacking, dequantization, and GEMM/GEMV execution into a single kernel flow, introduces an interleaved pre-tiling layout that supports GEMM dimensions up to 32K, and redesigns GEMV dataflow to utilize the full 4x8 AIE array. Across kernel-level evaluations, TileFuse improves performance by up to 121.6% for GEMM and 281% for GEMV over full-precision baselines, while delivering more than 2x performance and energy-efficiency gains over strong iGPU baselines on GEMM. In end-to-end LLM experiments on Ryzen AI laptops, TileFuse achieves up to 2.0x lower prefilling latency with more than 64.6% lower energy consumption. Together, these results show that XDNA2 is a practical target for AWQ-style edge LLM inference and that native NPU support for off-the-shelf quantization can make NPUs substantially more usable in real client deployments.

翻译：随着设备端大模型推理需求的增长，边缘SoC日益集成NPU以在严苛的功耗与热预算约束下提升性能与能效。然而，当前客户端NPU上的大模型实际部署仍面临挑战：诸如AWQ等广泛使用的量化格式难以直接映射到许多现有NPU软件栈，这些软件栈通常为专有系统且对底层控制暴露有限。本工作提出面向AMD XDNA2 NPU的底层混合精度核函数库TileFuse，专为量化大模型推理中的Transformer线性层设计。TileFuse将AWQ风格的W4A16与W8A16等实用低比特格式直接引入XDNA2，而非迫使模型适配NPU特有的量化方案。该库协同设计了权重布局、元数据放置、混合精度微核函数及阵列级数据流：具体而言，其将解包、反量化与GEMM/GEMV执行融合为单一核函数流，引入支持32K维度GEMM的交错预分块布局，并重设计GEMV数据流以充分利用4x8 AIE阵列。在核函数级评估中，TileFuse相比全精度基线实现GEMM性能提升高达121.6%、GEMV性能提升达281%；而在GEMM任务上，相较于强iGPU基线，其性能与能效增益均超过2倍。在Ryzen AI笔记本端到端大模型实验中，TileFuse实现预填充延迟最高降低2.0倍，能耗降低超过64.6%。综合实验表明：XDNA2是AWQ风格边缘大模型推理的实用平台，且NPU对现成量化方案的原生支持可大幅提升NPU在客户端实际部署中的可用性。