Distributed Machine Learning (DML) systems are utilized to enhance the speed of model training in data centers (DCs) and edge nodes. The Parameter Server (PS) communication architecture is commonly employed, but it faces severe long-tail latency caused by many-to-one "incast" traffic patterns, negatively impacting training throughput. To address this challenge, we design the \textbf{L}oss-tolerant \textbf{T}ransmission \textbf{P}rotocol (LTP), which permits partial loss of gradients during synchronization to avoid unneeded retransmission and contributes to faster synchronization per iteration. LTP implements loss-tolerant transmission through \textit{out-of-order transmission} and \textit{out-of-order Acknowledges (ACKs)}. LTP employs \textit{Early Close} to adjust the loss-tolerant threshold based on network conditions and \textit{Bubble Filling} for data correction to maintain training accuracy. LTP is implemented by C++ and integrated into PyTorch. Evaluations on a testbed of 8 worker nodes and one PS node demonstrate that LTP can significantly improve DML training task throughput by up to 30x compared to traditional TCP congestion controls, with no sacrifice to final accuracy.
翻译:分布式机器学习(DML)系统被用于加速数据中心(DC)和边缘节点中的模型训练。参数服务器(PS)通信架构虽被广泛采用,却因多对一"incast"流量模式导致严重的长尾延迟,对训练吞吐量产生负面影响。为解决这一挑战,我们设计了**容忍丢包传输协议**(LTP),该协议允许在同步过程中部分丢失梯度,从而避免不必要的重传,并有助于每次迭代实现更快的同步。LTP通过**乱序传输**和**乱序确认**(ACKs)实现容忍丢包传输。LTP采用**提前关闭**机制根据网络状况调整丢包容忍阈值,并使用**气泡填充**进行数据校正以维持训练精度。LTP基于C++实现并集成至PyTorch。在包含8个工作节点与1个PS节点的测试平台上进行的评估表明,与传统TCP拥塞控制相比,LTP可将DML训练任务吞吐量提升高达30倍,且最终精度不受影响。