Communication has emerged as a critical bottleneck in the distributed training of large language models (LLMs). While numerous approaches have been proposed to reduce communication overhead, the potential of lossless compression has remained largely underexplored since compression and decompression typically consume larger overheads than the benefits of reduced communication traffic. We observe that the communication data, including activations, gradients and parameters, during training often follows a near-Gaussian distribution, which is a key feature for data compression. Thus, we introduce ZipCCL, a lossless compressed communication library of collectives for LLM training. ZipCCL is equipped with our novel techniques: (1) theoretically grounded exponent coding that exploits the Gaussian distribution of LLM tensors to accelerate compression without expensive online statistics, (2) GPU-optimized compression and decompression kernels that carefully design memory access patterns and pipeline using communication-aware data layout, and (3) adaptive communication strategies that dynamically switch collective operations based on workload patterns and system characteristics. Evaluated on a 64-GPU cluster using both mixture-of-experts and dense transformer models, ZipCCL reduces communication time by up to 1.35$\times$ and achieves end-to-end training speedups of up to 1.18$\times$ without any impact on model quality.
翻译:通信已成为大语言模型分布式训练中的关键瓶颈。尽管已有众多方法被提出用于降低通信开销,但无损压缩的潜力仍未得到充分探索,因为压缩与解压通常消耗比通信流量减少带来的收益更大的开销。我们观察到训练过程中的通信数据(包括激活值、梯度和参数)常呈现近似高斯分布,这是数据压缩的关键特征。因此,我们提出ZipCCL——一个面向大语言模型训练的无损压缩通信集合库。ZipCCL配备了我们的创新技术:(1)基于理论的指数编码,利用大语言模型张量的高斯分布特性,无需昂贵在线统计即可加速压缩;(2)GPU优化的压缩与解压内核,通过通信感知的数据布局精心设计内存访问模式与流水线;(3)自适应通信策略,根据工作负载模式与系统特性动态切换集合操作。在64块GPU集群上使用混合专家模型与稠密Transformer模型的评估表明,ZipCCL在不影响模型质量的情况下,最高可将通信时间降低1.35倍,并实现最高1.18倍的端到端训练加速。