The rapid scaling of Large Language Models presents significant challenges for their deployment and inference, particularly on resource-constrained specialized AI hardware accelerators such as Huawei's Ascend NPUs, where weight data transfer has become a critical performance bottleneck. While lossless compression can preserve model accuracy and reduce data volume, existing lossless compression algorithms exhibit extremely low throughput when ported to the Ascend NPU architecture. In this paper, we propose ENEC, a novel lossless compression method specifically customized for AI model weights and optimized for Ascend Neural Processing Units. ENEC adopts a block-based fixed-length encoding scheme and incorporates a series of NPU-specific optimizations: bit-width quantization with hierarchical halving bit-packing, vectorized branch-free integer transformation, and dependency-decoupled intra-segment scan for efficient prefix-sum computation. Experimental results demonstrate that ENEC outperforms existing state-of-the-art NPU compressors in both compression ratio and throughput. Compared to leading GPU solutions, ENEC achieves a 3.43X higher throughput than DietGPU and a 1.12X better compression ratio than nvCOMP. By reducing weight transmission overhead, ENEC significantly improves end-to-end inference performance, achieving up to a 6.3X speedup. On Ascend NPUs, ENEC is the first open-source lossless compression algorithm for model weights that achieves performance comparable to state-of-the-art GPU compressors, offering an effective solution for deploying large-scale AI models.
翻译:大语言模型的快速扩展给其部署和推理带来了重大挑战,尤其在资源受限的专用AI硬件加速器(如华为昇腾NPU)上,权重数据传输已成为关键性能瓶颈。虽然无损压缩能够保持模型精度并减少数据量,但现有无损压缩算法在移植到昇腾NPU架构时吞吐率极低。本文提出ENEC,一种专为AI模型权重定制并针对昇腾神经处理单元优化的新型无损压缩方法。ENEC采用基于块的定长编码方案,并融入一系列NPU特定优化:位宽量化与层级折半比特打包、向量化无分支整数变换,以及用于高效前缀和计算的依赖解耦段内扫描。实验结果表明,ENEC在压缩比和吞吐率方面均优于现有最优NPU压缩器。与领先的GPU方案相比,ENEC的吞吐率较DietGPU提升3.43倍,压缩比较nvCOMP提升1.12倍。通过降低权重传输开销,ENEC显著提升了端到端推理性能,最高可实现6.3倍加速。在昇腾NPU上,ENEC是首个性能可媲美最优GPU压缩器的模型权重开源无损压缩算法,为大规模AI模型部署提供了有效解决方案。