The all-to-all collective communications primitive is widely used in machine learning (ML) and high performance computing (HPC) workloads, and optimizing its performance is of interest to both ML and HPC communities. All-to-all is a particularly challenging workload that can severely strain the underlying interconnect bandwidth at scale. This is mainly because of the quadratic scaling in the number of messages that must be simultaneously serviced combined with large message sizes. This paper takes a holistic approach to optimize the performance of all-to-all collective communications on supercomputer-scale direct-connect interconnects. We address several algorithmic and practical challenges in developing efficient and bandwidth-optimal all-to-all schedules for any topology, lowering the schedules to various backends and fabrics that may or may not expose additional forwarding bandwidth, establishing an upper bound on all-to-all throughput, and exploring novel topologies that deliver near-optimal all-to-all performance.
翻译:全对全集体通信原语广泛应用于机器学习(ML)和高性能计算(HPC)任务中,优化其性能对于ML与HPC社区均具有重要意义。全对全通信是一项极具挑战性的工作负载,可能会严重限制大规模底层互连带宽的扩展能力,这主要源于需同时服务的消息数量呈二次方增长,且消息体量庞大。本文从全局视角出发,针对超级计算机规模下的直接连接互连架构,对全对全集体通信性能进行优化。我们解决了为任意拓扑开发高效且带宽最优的全对全调度方案时面临的若干算法与工程挑战:包括将调度方案适配至可能暴露或隐藏额外转发带宽的不同后端与网络结构、建立全对全吞吐量的理论上界,以及探索能够实现接近最优全对全性能的新型拓扑结构。