Large-scale AI models, such as Large Language Models (LLMs) and Diffusion Models (DMs), have grown rapidly in size, creating significant challenges for efficient deployment on resource-constrained hardware. In this paper, we introduce Dynamic-Length Float (DFloat11), a lossless compression framework that reduces LLM and DM size by 30% while preserving outputs that are bit-for-bit identical to the original model. DFloat11 is motivated by the low entropy in the BFloat16 weight representation of LLMs, which reveals significant inefficiency in the existing storage format. By applying entropy coding, DFloat11 assigns dynamic-length encodings to weights based on frequency, achieving near information-optimal compression without any loss of precision. To facilitate efficient inference with dynamic-length encodings, we develop a custom GPU kernel for fast online decompression. Our design incorporates the following: (i) compact, hierarchical lookup tables (LUTs) that fit within GPU SRAM for efficient decoding, (ii) a two-phase GPU kernel for coordinating thread read/write positions using lightweight auxiliary variables, and (iii) transformer-block-level decompression to minimize latency. Experiments on Llama 3.3, Qwen 3, Mistral 3, FLUX.1, and others validate our hypothesis that DFloat11 achieves around 30% model size reduction while preserving bit-for-bit identical outputs. Compared to a potential alternative of offloading parts of an uncompressed model to the CPU to meet memory constraints, DFloat11 achieves 2.3--46.2x higher throughput in token generation. With a fixed GPU memory budget, DFloat11 enables 5.7--14.9x longer generation lengths than uncompressed models. Notably, our method enables lossless inference of Llama 3.1 405B, an 810GB model, on a single node equipped with 8x80GB GPUs.
翻译:大型人工智能模型,例如大语言模型 (LLMs) 和扩散模型 (DMs),其规模迅速增长,给在资源受限的硬件上进行高效部署带来了重大挑战。本文提出了动态长度浮点数 (DFloat11),一种无损压缩框架,可将 LLM 和 DM 的模型大小减少 30%,同时保持输出与原始模型比特级完全相同。DFloat11 的提出动机源于 LLMs 的 BFloat16 权重表示具有低熵特性,这揭示了现有存储格式存在显著的低效性。通过应用熵编码,DFloat11 根据权重频率分配动态长度编码,实现了近乎信息理论最优的压缩,且不损失任何精度。为了支持动态长度编码的高效推理,我们开发了一个定制的 GPU 内核,用于快速在线解压缩。我们的设计包含以下部分:(i) 紧凑的、分层级的查找表 (LUTs),可适配 GPU SRAM 以实现高效解码;(ii) 一个两阶段的 GPU 内核,利用轻量级辅助变量协调线程的读写位置;(iii) 基于 Transformer 模块级别的解压缩,以最小化延迟。在 Llama 3.3、Qwen 3、Mistral 3、FLUX.1 等模型上的实验验证了我们的假设:DFloat11 实现了约 30% 的模型大小缩减,同时保持比特级完全相同的输出。与将未压缩模型的部分内容卸载到 CPU 以满足内存限制的潜在替代方案相比,DFloat11 在令牌生成吞吐量上实现了 2.3 至 46.2 倍的提升。在固定的 GPU 内存预算下,DFloat11 支持的生成长度比未压缩模型长 5.7 至 14.9 倍。值得注意的是,我们的方法使得在配备 8 块 80GB GPU 的单个节点上对 810GB 的 Llama 3.1 405B 模型进行无损推理成为可能。