Traffic from distributed training of machine learning (ML) models makes up a large and growing fraction of the traffic mix in enterprise data centers. While work on distributed ML abounds, the network traffic generated by distributed ML has received little attention. Using measurements on a testbed network, we investigate the traffic characteristics generated by the training of the ResNet-50 neural network with an emphasis on studying its short-term burstiness. For the latter we propose metrics that quantify traffic burstiness at different time scales. Our analysis reveals that distributed ML traffic exhibits a very high degree of burstiness on short time scales, exceeding a 60:1 peak-to-mean ratio on time intervals as long as 5~ms. We observe that training software orchestrates transmissions in such a way that burst transmissions from different sources within the same application do not result in congestion and packet losses. An extrapolation of the measurement data to multiple applications underscores the challenges of distributed ML traffic for congestion and flow control algorithms.
翻译:来自机器学习模型分布式训练的流量在企业数据中心流量混合中占据且持续增长着显著份额。尽管分布式机器学习相关研究层出不穷,但其产生的网络流量却鲜受关注。通过测试床网络中的测量,我们对ResNet-50神经网络训练过程中产生的流量特征展开研究,重点考察其短期突发性。针对后者,我们提出了量化不同时间尺度流量突发性的指标。分析表明,分布式机器学习流量在短时间尺度上呈现出极高的突发性,在长达5毫秒的时间间隔内峰值与均值之比超过60:1。我们观察到训练软件对传输进行编排,使得同一应用内部不同源发起的突发传输不会引发拥塞和数据包丢失。将测量数据外推至多应用场景时,进一步凸显了分布式机器学习流量对拥塞控制及流量控制算法带来的挑战。