The all-to-all collective communications primitive is widely used in machine learning (ML) and high performance computing (HPC) workloads, and optimizing its performance is of interest to both ML and HPC communities. All-to-all is a particularly challenging workload that can severely strain the underlying interconnect bandwidth at scale. This paper takes a holistic approach to optimize the performance of all-to-all collective communications on supercomputer-scale direct-connect interconnects. We address several algorithmic and practical challenges in developing efficient and bandwidth-optimal all-to-all schedules for any topology and lowering the schedules to various runtimes and interconnect technologies. We also propose a novel topology that delivers near-optimal all-to-all performance.
翻译:全对全集合通信原语广泛应用于机器学习(ML)和高性能计算(HPC)工作负载中,优化其性能对于ML和HPC社区均具有重要意义。全对全通信是一种极具挑战性的工作负载,在规模扩大时可能严重压榨底层互连带宽。本文采用整体优化的方法,针对超级计算机规模的直连互连架构,提升全对全集合通信的性能。我们解决了开发任意拓扑下高效且带宽最优的全对全调度方案,以及将该调度方案适配至不同运行时环境和互连技术时面临的多种算法与实践挑战。此外,我们提出了一种新型拓扑结构,可实现接近最优的全对全通信性能。