As modern DNN models grow ever larger, collective communications between the accelerators (allreduce, etc.) emerge as a significant performance bottleneck. Designing efficient communication schedules is challenging given today's highly diverse and heterogeneous network fabrics. In this paper, we present ForestColl, a tool that generates efficient schedules for any network topology. ForestColl constructs broadcast/aggregation spanning trees as the communication schedule, achieving theoretically minimum network congestion. Its schedule generation runs in strongly polynomial time and is highly scalable. ForestColl supports any network fabrics, including both switching fabrics and direct connections, as well as any network graph structure. We evaluated ForestColl on multi-cluster AMD MI250 and NVIDIA A100 platforms. ForestColl's schedules achieved up to 52\% higher performance compared to the vendors' own optimized communication libraries, RCCL and NCCL. ForestColl also outperforms other state-of-the-art schedule generation techniques with both up to 61\% more efficient generated schedules and orders of magnitude faster schedule generation speed.
翻译:随着现代深度神经网络模型的规模不断增大,加速器之间的集合通信(如Allreduce等)成为显著的性能瓶颈。在当前高度多样化和异构化的网络结构下,设计高效的通信调度方案极具挑战性。本文提出了ForestColl——一种能为任意网络拓扑生成高效调度方案的工具。ForestColl构建广播/聚合生成树作为通信调度方案,实现了理论上的最小网络拥塞。其调度生成算法在强多项式时间内运行,且具有高度可扩展性。ForestColl支持任何网络结构,包括交换网络和直连网络,以及任意网络图结构。我们在多集群AMD MI250和NVIDIA A100平台上评估了ForestColl。相比于厂商自有的优化通信库RCCL和NCCL,ForestColl的调度方案实现了高达52%的性能提升。此外,ForestColl在其他最先进的调度生成技术中同样表现优异:其生成的调度效率最高提升61%,且调度生成速度快数个数量级。