Distributed machine learning (ML) training has become a necessity with the prevalence of billion to trillion-parameter-scale models. While prior work has improved training efficiency from the ML perspective at the application layer, it often fails to address transient congestion events at the network layer that introduce severe tail latency and training-time variability, thereby undermining the quality of service (QoS) of distributed ML training systems. Existing network optimizations treat all gradients equally and thus fail to integrate sufficient model-training insights into communication protocol design. In this paper, we present Dynamic Bounded-Loss Protocol (DBLP), a burst-resilient, training-phase-aware, and hardware-agnostic transport protocol that incorporates model-level tolerance properties into gradient communication. By dynamically adjusting gradient loss tolerance across training phases, DBLP reduces overall training time and mitigates tail-latency collapse during transient high-loss events (i.e., microbursts). Compared to the current state-of-the-art solution (baseline), DBLP tolerates significantly higher loss while achieving comparable test accuracy, and reduces end-to-end training time by an average of 24.4% and a maximum of 33.9%. At microburst events, DBLP achieves up to 5.88x single-round communication latency speedups over the baseline, preventing burst-induced tail-latency spikes and maintaining stable training performance.
翻译:分布式机器学习(ML)训练已成为处理千亿至万亿参数规模模型的必然选择。虽然先前工作从ML角度在应用层提升了训练效率,但往往未能解决网络层瞬态拥塞事件引发的严重尾延迟和训练时间波动,这损害了分布式ML训练系统的服务质量(QoS)。现有网络优化将所有梯度等同处理,未能将充分的模型训练洞察融入通信协议设计。本文提出动态有界丢包协议(DBLP),这是一种具有突发鲁棒性、训练阶段感知且硬件无关的传输协议,它将模型级别的容错特性引入梯度通信。通过跨训练阶段动态调整梯度丢包容忍度,DBLP缩短了总体训练时间,并缓解了瞬态高丢包事件(即微爆流)期间的尾延迟崩溃。与当前最优解决方案(基线)相比,DBLP在实现可比测试精度的同时能容忍显著更高的丢包率,并将端到端训练时间平均减少24.4%、最大减少33.9%。在微爆流事件中,DBLP的单轮通信延迟相较于基线最高加速5.88倍,有效防止了突发引发的尾延迟尖峰,维持了稳定的训练性能。