As Large Language Models (LLMs) scale, weight-only quantization (W4A16: 4-bit weights, 16-bit activations) becomes critical for reducing memory footprint with minimal accuracy loss. However, its efficient deployment on Huawei's Ascend 910 Neural Processing Unit (NPU) is challenging due to limited native mixed-precision support and the accelerator's decoupled compute architecture. To enable quantization on such architecture, we present the first practical W4A16 matrix multiplication kernel tailored for the Ascend 910 NPU. Our design leverages vector cores for on-the-fly INT4-to-FP16 dequantization, cube cores for high-throughput GEMM, and Split-K parallelization to mitigate memory latency. Performance evaluations across diverse matrix shapes and batch sizes show our method outperforms data-parallel approaches when K >> N, a typical scenario in LLM decoding. Specially, our method can achieve a speedup ranging from 1.01x to 1.74x. In addition, our profile reveals the primary bottleneck is not dequantization compution itself, but extra global memory transfer for the weight, making W4A16 only reaching a maximum speedup of 1.48x over native FP16xFP16 matrix multiplication in PyTorch. In the long run, our method lays a solid foundation and provides insightful views for the efficient deployment of quantized large language models on various domain-specific accelerators.
翻译:随着大语言模型(LLM)规模的扩大,仅权重量化(W4A16:4位权重,16位激活)对于以最小精度损失减少内存占用变得至关重要。然而,由于原生混合精度支持有限以及加速器的解耦计算架构,其在华为昇腾910神经网络处理器(NPU)上的高效部署面临挑战。为在此类架构上实现量化,我们提出了首个专为昇腾910 NPU设计的实用W4A16矩阵乘法核函数。我们的设计利用向量核心进行即时INT4至FP16反量化,利用立方体核心实现高吞吐量GEMM,并采用Split-K并行化以缓解内存延迟。针对不同矩阵形状和批处理大小的性能评估表明,在K >> N(LLM解码的典型场景)时,我们的方法优于数据并行方法。具体而言,我们的方法可实现1.01倍至1.74倍的加速比。此外,我们的性能剖析揭示主要瓶颈并非反量化计算本身,而是权重所需的额外全局内存传输,这使得W4A16相比PyTorch中原生的FP16xFP16矩阵乘法仅能达到最高1.48倍的加速比。长远来看,我们的方法为在各种领域专用加速器上高效部署量化大语言模型奠定了坚实基础并提供了深刻见解。