Data and pipeline parallelism are key strategies for scaling neural network training across distributed devices, but their high communication cost necessitates co-located computing clusters with fast interconnects, limiting their scalability. We address this communication bottleneck by introducing asynchronous updates across both parallelism axes, relaxing the co-location requirement at the expense of introducing staleness between pipeline stages and data parallel replicas. To mitigate staleness, for pipeline parallelism, we adopt a weight look-ahead approach, and for data parallelism, we introduce an asynchronous sparse averaging method equipped with an exponential moving average based correction mechanism. We provide convergence guarantees for both sparse averaging and asynchronous updates. Experiments on large-scale language models (up to \em 1B parameters) demonstrate that our approach matches the performance of the fully synchronous baseline, while significantly reducing communication overhead.
翻译:数据并行与流水线并行是神经网络在分布式设备上进行扩展训练的关键策略,但其高昂的通信开销要求计算集群具备快速互连且需紧密部署,从而限制了可扩展性。我们通过在两个并行维度上引入异步更新来解决这一通信瓶颈,放宽了紧密部署的要求,代价是引入了流水线阶段间与数据并行副本间的状态滞后。为缓解滞后影响,在流水线并行中我们采用权重前瞻方法,在数据并行中则引入配备基于指数移动平均校正机制的异步稀疏平均方法。我们为稀疏平均与异步更新提供了收敛性保证。在大规模语言模型(参数规模达 \em 1B)上的实验表明,我们的方法在显著降低通信开销的同时,达到了全同步基线的性能水平。