Training and serving Large Language Models (LLMs) relies heavily on parallelization and collective operations, which are frequently bottlenecked by network bandwidth. Lossless compression using e.g., Huffman codes can alleviate the issue, however, Huffman codes suffer from slow, bit-sequential decoding and high hardware complexity due to deep tree traversals. Universal codes e.g., Exponential-Golomb codes are faster to decode but do not exploit the symbol frequency distributions. To address these limitations, this paper introduces Dual Length Codes, a hybrid approach designed to balance compression efficiency with decoding speed. Analyzing BFloat16 tensors from the Gemma model, we observed that the top 8 most frequent symbols account for approximately 50% of the cumulative probability. These 8 symbols are assigned a short 4 bit code. The remaining 248 symbols are assigned a longer 9 bit code. The coding scheme uses a single prefix bit to distinguish between the two code lengths. The scheme uses a small Look Up Table with only 8 entries for encoding and decoding. The scheme achieves a compressibility of 18.6% in comparison to 21.3% achieved by Huffman codes, but it significantly speeds up the decoding and simplifies the hardware complexity.
翻译:大型语言模型(LLM)的训练与部署高度依赖并行化与集体操作,这些操作常受网络带宽瓶颈制约。采用霍夫曼码等无损压缩方法可缓解此问题,但霍夫曼码因深度树遍历导致解码速度慢(需按位串行处理)且硬件复杂度高。指数哥伦布码等通用码解码速度更快,但无法利用符号频率分布特性。为克服这些局限,本文提出双长度码——一种兼顾压缩效率与解码速度的混合方法。通过分析Gemma模型的BFloat16张量,我们发现前8个最高频符号约占累积概率的50%。这8个符号被分配4位短码,其余248个符号则分配9位长码。该编码方案使用单个前缀位区分两种码长,并采用仅含8个表项的小型查找表进行编解码。相较于霍夫曼码实现的21.3%压缩率,本方案达到18.6%的压缩率,同时显著提升了解码速度并降低了硬件复杂度。