Large-scale distributed training in production datacenters constitutes a challenging workload bottlenecked by network communication. In response, both major industry players (e.g., Ultra Ethernet Consortium) and parts of academia have surprisingly, and almost unanimously, agreed that packet spraying is necessary to improve the performance of large-scale distributed training workloads. In this paper, we challenge this prevailing belief and pose the question: How close can a singlepath transport approach an optimal multipath transport? We demonstrate that singlepath transport (from a NIC's perspective) is sufficient and can perform nearly as well as an ideal multipath transport with packet spraying, particularly in the context of distributed training in leaf-spine topologies. Our assertion is based on four key observations about workloads driven by collective communication patterns: (i) flows within a collective start almost simultaneously, (ii) flow sizes are nearly equal, (iii) the completion time of a collective is more crucial than individual flow completion times, and (iv) flows can be split upon arrival. We analytically prove that singlepath transport, using minimal flow splitting (at the application layer), is equivalent to an ideal multipath transport with packet spraying in terms of maximum congestion. Our preliminary evaluations support our claims. This paper suggests an alternative agenda for developing next-generation transport protocols tailored for large-scale distributed training.
翻译:生产数据中心的大规模分布式训练构成了一个受网络通信制约的挑战性工作负载瓶颈。作为应对,主要行业参与者(如超以太网联盟)和部分学术界令人惊讶且几乎一致地认为,数据包喷洒对于提升大规模分布式训练工作负载的性能是必要的。在本文中,我们挑战这一普遍观点,并提出问题:单路径传输方法能在多大程度上接近最优的多路径传输?我们证明,单路径传输(从网卡的角度看)是足够的,并且其性能可以接近理想的数据包喷洒多路径传输,特别是在叶脊拓扑中的分布式训练场景下。我们的论断基于对由集体通信模式驱动的工作负载的四个关键观察:(i) 集体内的流几乎同时开始,(ii) 流大小几乎相等,(iii) 集体的完成时间比单个流的完成时间更为关键,以及(iv) 流在到达时可以拆分。我们通过分析证明,采用最小化流拆分(在应用层)的单路径传输,在最大拥塞方面等同于理想的采用数据包喷洒的多路径传输。我们的初步评估支持了我们的主张。本文为开发专为大规模分布式训练定制的下一代传输协议提出了一个替代议程。