We present MLTCP, a technique to augment today's congestion control algorithms to accelerate DNN training jobs in shared GPU clusters. MLTCP enables the communication phases of jobs that compete for network bandwidth to interleave with each other, thereby utilizing the network efficiently. At the heart of MLTCP lies a very simple principle based on a key conceptual insight: DNN training flows should scale their congestion window size based on the number of bytes sent at each training iteration. We show that integrating this principle into today's congestion control protocols is straightforward: by adding 30-60 lines of code to Reno, CUBIC, or DCQCN, MLTCP stabilizes flows of different jobs into an interleaved state within a few training iterations, regardless of the number of competing flows or the start time of each flow. Our experiments with popular DNN training jobs demonstrate that enabling MLTCP accelerates the average and 99th percentile training iteration time by up to 2x and 4x, respectively.
翻译:我们提出MLTCP,一种增强现有拥塞控制算法以加速共享GPU集群中DNN训练任务的技术。MLTCP使得争夺网络带宽的各训练任务通信阶段能够相互交织,从而高效利用网络资源。其核心基于一个关键概念洞见构成的简洁原理:DNN训练流应根据每次训练迭代发送的字节数来调整拥塞窗口大小。我们证明,将该原理集成到现有拥塞控制协议中十分直接:通过在Reno、CUBIC或DCQCN中添加30-60行代码,MLTCP可在数轮训练迭代内将不同任务的流稳定至交织状态,且不受竞争流数量或各流起始时间的影响。基于主流DNN训练任务的实验表明,启用MLTCP可使平均训练迭代时间和第99百分位训练迭代时间分别加速高达2倍和4倍。