Large language models (LLMs) now scale to trillions of parameters, driving weight storage into the terabyte regime and creating an acute mismatch with GPU memory capacity. Although lossless compression is widely effective in other domains, it remains underutilized in LLM systems. Through a comprehensive entropy study across models from 1.5B to 405B parameters and numeric formats ranging from bf16 to int4 and AWQ/SQ8, we find that LLM weights contain far less intrinsic randomness than their stored bitwidth implies, their effective entropy is 2-10x lower, indicating that up to a 10x footprint reduction is theoretically achievable without altering any weight values. Leveraging this insight, we introduce a tile-level, on-the-fly lossless decompression framework based on Asymmetric Numeral Systems that aligns decoding with the GEMM tiling pattern of GPU inference. Our design achieves bit-rates within 0.01-0.1 bits of the Shannon limit across a wide range of LLM numerical formats, demonstrating that nearly all statistical redundancy is eliminated. Integrated into the SGLang serving framework with multi-GPU support, our approach increases the maximum batch size of Qwen-14B from 47 to 75, improving throughput by up to 1.2x. On Mixtral-176B, the feasible batch size increases from 20 to 95 (4.8x), yielding up to 1.6x throughput improvement. Compared to state-of-the-art lossless compression approaches NeuZip and DFloat11, our design further improves throughput by up to 11x.
翻译:大语言模型(LLMs)现已扩展至万亿参数规模,权重存储进入太字节(TB)量级,与GPU内存容量形成严重失衡。虽然无损压缩在其他领域广泛应用,但在大模型系统中尚未得到充分利用。通过对1.5B至405B参数规模的模型以及bf16、int4、AWQ/SQ8等数值格式的全面熵分析,我们发现大模型权重所含的内在随机性远低于其存储位宽所暗示的水平——其有效熵值低2-10倍,这意味着在不改变任何权重值的前提下,理论上可实现高达10倍的存储压缩。基于这一发现,我们提出了一种基于非对称数字系统(Asymmetric Numeral Systems)的瓦片级、即时无损解压框架,使解码过程与GPU推理中的通用矩阵乘法(GEMM)分块模式对齐。在多种大模型数值格式下,我们的设计实现了与香农极限仅差0.01-0.1比特的码率,表明几乎消除了所有统计冗余。该方案集成至支持多GPU的SGLang推理框架后,将Qwen-14B的最大批处理量从47提升至75,吞吐量提升达1.2倍。对于Mixtral-176B,可行批处理量从20增至95(提升4.8倍),吞吐量提升达1.6倍。与当前最先进的无损压缩方法NeuZip和DFloat11相比,我们的设计进一步将吞吐量提升达11倍。