In the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be interleaved to maximize the communication/computation overlap. In this scenario, outstanding operations such as Allgather and Reduce-Scatter can compete for the injection bandwidth and create pipeline bubbles. To address this problem, we propose a novel bandwidth-optimal Allgather collective algorithm that leverages hardware multicast. We use multicast to build a constant-time reliable Broadcast protocol, a building block for constructing an optimal Allgather schedule. Our Allgather algorithm achieves 2x traffic reduction on a 188-node testbed. To free the host side from running the protocol, we employ SmartNIC offloading. We extract the parallelism in our Allgather algorithm and map it to a SmartNIC specialized for hiding the cost of data movement. We show that our SmartNIC-offloaded collective progress engine can scale to the next generation of 1.6 Tbit/s links.
翻译:在全分片数据并行(FSDP)训练流水线中,可通过交错执行集合操作以最大化通信/计算重叠。在此场景下,未完成的Allgather与Reduce-Scatter等操作会竞争注入带宽并产生流水线气泡。为解决该问题,我们提出一种利用硬件组播的新型带宽最优Allgather集合算法。我们采用组播构建恒定延迟的可靠广播协议,作为构造最优Allgather调度方案的基础模块。我们的Allgather算法在188节点测试平台上实现了2倍的流量削减。为将主机从协议执行中解放出来,我们采用智能网卡卸载技术。通过提取Allgather算法中的并行性,并将其映射至专为隐藏数据移动开销而设计的智能网卡,我们证明所提出的智能网卡卸载集合进度引擎可扩展至下一代1.6 Tbit/s链路。