Handling communication overhead in large-scale tensor-parallel training remains a critical challenge due to the dense, near-zero distributions of intermediate tensors, which exacerbate errors under frequent communication and introduce significant computational overhead during compression. To this end, we propose TACO (Tensor-parallel Adaptive COmmunication compression), a robust FP8-based framework for compressing TP intermediate tensors. First, we employ a data-driven reshaping strategy combined with an Adaptive Scale-Hadamard Transform to enable high-fidelity FP8 quantization, while its Dual-Scale Quantization mechanism ensures numerical stability throughout training. Second, we design a highly fused compression operator to reduce memory traffic and kernel launch overhead, allowing efficient overlap with communication. Finally, we integrate TACO with existing state-of-the-art methods for Data and Pipeline Parallelism to develop a compression-enabled 3D-parallel training framework. Detailed experiments on GPT models and Qwen model demonstrate up to 1.87X end-to-end throughput improvement while maintaining near-lossless accuracy, validating the effectiveness and efficiency of TACO in large-scale training.
翻译:在大规模张量并行训练中处理通信开销仍是一个关键挑战,这是由于中间张量存在密集且近零的分布特性,这种分布会在频繁通信中加剧误差,并在压缩过程中引入显著计算开销。为此,我们提出TACO(张量并行自适应通信压缩),这是一种基于FP8的鲁棒框架,用于压缩张量并行中间张量。首先,我们采用数据驱动重塑策略结合自适应尺度-Hadamard变换,实现高保真FP8量化,同时其双尺度量化机制确保训练全程数值稳定性。其次,我们设计高度融合的压缩算子以降低内存访问和内核启动开销,从而实现与通信的高效重叠。最后,我们将TACO与现有最先进的数据并行和流水线并行方法集成,构建支持压缩的3D并行训练框架。在GPT系列模型和Qwen模型上的详细实验表明,端到端吞吐量最高提升1.87倍,同时保持近乎无损的精度,验证了TACO在大规模训练中的有效性和高效性。