Training large language models (LLMs) with increasingly long and varying sequence lengths introduces severe load imbalance challenges in large-scale data-parallel training. Recent frameworks attempt to mitigate these issues through data reorganization or hybrid parallel strategies. However, they often overlook how computational and communication costs scale with sequence length, resulting in suboptimal performance. We identify three critical challenges: (1) varying computation-to-communication ratios across sequences of different lengths in distributed attention, (2) mismatch between static NIC-GPU affinity and dynamic parallel workloads, and (3) distinct optimal partitioning strategies required for quadratic attention versus linear components. To address these challenges, we present Zeppelin, a novel training system that integrates three key techniques: (1) a hierarchical sequence partitioning method for the attention module that reduces communication overhead and balances computation, supported by an efficient attention engine that applies divergent parallel strategies; (2) a routing layer that orchestrates inter-node transfers to fully utilize NIC bandwidth; and (3) a remapping layer that transforms sequence layouts between attention and linear modules, ensuring high computational efficiency across both. Comprehensive evaluations across diverse configurations show that Zeppelin delivers an average 2.80x speedup over state-of-the-art methods.
翻译:随着大语言模型(LLM)训练序列长度不断增加且变化多样,在大规模数据并行训练中引入了严重的负载不均衡挑战。现有框架试图通过数据重组或混合并行策略缓解这些问题,但往往忽略了计算与通信开销随序列长度的变化规律,导致性能未达最优。我们识别出三个关键挑战:(1)分布式注意力中不同长度序列的计算-通信比存在差异;(2)静态网卡-GPU亲和性与动态并行工作负载不匹配;(3)二次复杂度的注意力模块与线性模块需要不同的最优分区策略。为解决这些挑战,我们提出Zeppelin——一种集成三项关键技术的创新训练系统:(1)面向注意力模块的层次化序列分区方法,通过支持差异化并行策略的高效注意力引擎降低通信开销并均衡计算负载;(2)路由层协调节点间数据传输以充分利用网卡带宽;(3)重映射层在注意力模块与线性模块间转换序列布局,确保两者均保持高计算效率。在多类配置下的综合实验表明,Zeppelin相比现有最优方法平均可实现2.80倍的加速。