Lossless model compression holds tremendous promise for alleviating the memory and bandwidth bottlenecks in bit-exact Large Language Model (LLM) serving. However, existing approaches often result in substantial inference slowdowns due to fundamental design mismatches with GPU architectures: at the kernel level, variable-length bitstreams produced by traditional entropy codecs break SIMT parallelism; at the system level, decoupled pipelines lead to redundant memory traffic. We present ZipServ, a lossless compression framework co-designed for efficient LLM inference. ZipServ introduces Tensor-Core-Aware Triple Bitmap Encoding (TCA-TBE), a novel fixed-length format that enables constant-time, parallel decoding, together with a fused decompression-GEMM (ZipGEMM) kernel that decompresses weights on-the-fly directly into Tensor Core registers. This "load-compressed, compute-decompressed" design eliminates intermediate buffers and maximizes compute intensity. Experiments show that ZipServ reduces the model size by up to 30%, achieves up to 2.21x kernel-level speedup over NVIDIA's cuBLAS, and expedites end-to-end inference by an average of 1.22x over vLLM. ZipServ is the first lossless compression system that provides both storage savings and substantial acceleration for LLM inference on GPUs.
翻译:无损模型压缩在缓解精确比特大语言模型(LLM)服务中的内存与带宽瓶颈方面具有巨大潜力。然而,现有方法常因与GPU架构的根本设计失配而导致显著的推理减速:在核心层面,传统熵编码器产生的变长比特流破坏了SIMT并行性;在系统层面,解耦的流水线导致冗余内存访问。本文提出ZipServ,一个专为高效LLM推理协同设计的无损压缩框架。ZipServ引入了张量核心感知三重位图编码(TCA-TBE),这是一种新型定长格式,支持恒定时间并行解码,并配合融合解压缩-通用矩阵乘法(ZipGEMM)内核,可将权重动态解压至张量核心寄存器。这种“加载压缩数据、计算时解压”的设计消除了中间缓冲区并最大化计算强度。实验表明,ZipServ将模型大小缩减达30%,在核心层面相比NVIDIA cuBLAS实现最高2.21倍加速,端到端推理速度较vLLM平均提升1.22倍。ZipServ是首个在GPU上为LLM推理同时提供存储节省与显著加速的无损压缩系统。