Training and serving Large Language Models (LLMs) relies heavily on parallelization and collective operations, which are frequently bottlenecked by network bandwidth. Lossless compression using e.g., Huffman codes can alleviate the issue, however, Huffman codes suffer from slow, bit-sequential decoding and high hardware complexity due to deep tree traversals. Universal codes e.g., Exponential-Golomb codes are faster to decode but do not exploit the symbol frequency distributions. To address these limitations, this paper introduces Quad Length Codes, a hybrid approach designed to balance compression efficiency with decoding speed. The coding scheme uses 3 prefix bits to divide the 256 symbols into 8 areas. Each area has a different code length and encodes a different number of symbols. The scheme uses a Look Up Table with 256 entries, significantly simplifying the hardware implementation compared to Huffman trees. The coding scheme can be adapted for different distributions. For the e4m3 data type, the scheme achieves a compressibility of 13.9% in comparison to 15.9% achieved by Huffman codes, but it significantly speeds up the decoding and simplifies the hardware complexity.
翻译:大型语言模型(LLM)的训练与服务部署高度依赖并行化与集体操作,这些操作常受网络带宽限制。使用霍夫曼码等无损压缩方法可缓解此问题,但霍夫曼码因深度树遍历导致解码速度慢(需按位串行处理)且硬件复杂度高。指数哥伦布码等通用编码解码更快,但未能利用符号频率分布特性。为克服这些局限,本文提出Quad Length Codes——一种兼顾压缩效率与解码速度的混合编码方案。该编码方案使用3位前缀将256个符号划分为8个区域,每个区域采用不同码长并编码不同数量的符号。方案通过仅含256项条目的查找表实现,相比霍夫曼树显著简化了硬件实现。该编码方案可适配不同分布特性。针对e4m3数据类型,本方案实现了13.9%的压缩率(霍夫曼码为15.9%),同时大幅提升了解码速度并降低了硬件复杂度。