Training and serving Large Language Models (LLMs) require partitioning data across multiple accelerators, where collective operations are frequently bottlenecked by network bandwidth. Lossless compression using Huffman codes is an effective way to alleviate the issue, however, its three-stage design requiring on-the-fly frequency analysis, codebook generation and transmission of codebook along with data introduces computational, latency and data overheads which are prohibitive for latency-sensitive scenarios such as die-to-die communication. This paper proposes a single-stage Huffman encoder that eliminates these overheads by using fixed codebooks derived from the average probability distribution of previous data batches. Through our analysis of the Gemma 2B model, we demonstrate that tensors exhibit high statistical similarity across layers and shards. Using this approach we achieve compression within 0.5% of per-shard Huffman coding and within 1% of the ideal Shannon compressibility, enabling efficient on-the-fly compression.
翻译:训练和服务大型语言模型(LLMs)需要将数据划分到多个加速器上,其中集体操作经常受限于网络带宽。使用霍夫曼编码的无损压缩是缓解此问题的有效方法,但其三阶段设计(需要实时频率分析、码本生成以及码本与数据的共同传输)引入了计算、延迟和数据开销,对于芯片间通信等延迟敏感场景而言,这些开销是难以接受的。本文提出一种单阶段霍夫曼编码器,通过使用从先前数据批次的平均概率分布推导出的固定码本,消除了这些开销。通过对Gemma 2B模型的分析,我们证明了张量在不同层和分片之间具有高度的统计相似性。采用此方法,我们实现了接近分片级霍夫曼编码(差距在0.5%以内)和接近理想香农可压缩性(差距在1%以内)的压缩效果,从而实现了高效的实时压缩。