The surge of artificial intelligence, particularly large language models, has driven the rapid development of large-scale machine learning clusters. Executing distributed models on these clusters is often constrained by communication overhead, making efficient utilization of available network resources crucial. As a result, the routing algorithm employed for collective communications (i.e., collective algorithms) plays a pivotal role in determining overall performance. Unfortunately, existing collective communication libraries for distributed machine learning are limited by a fixed set of basic collective algorithms. This limitation hinders communication optimization, especially in modern clusters with heterogeneous and asymmetric topologies. Furthermore, manually designing collective algorithms for all possible combinations of network topologies and collective patterns requires heavy engineering and validation efforts. To address these challenges, this paper presents TACOS, an autonomous synthesizer capable of automatically generating topology-aware collective algorithms tailored to specific collective patterns and network topologies. TACOS is highly flexible, synthesizing an All-Reduce algorithm for a heterogeneous 128-NPU system in just 1.08 seconds, while achieving up to a 4.27x performance improvement over state-of-the-art synthesizers. Additionally, TACOS demonstrates better scalability with polynomial synthesis times, in contrast to NP-hard approaches which only scale to systems with tens of NPUs. TACOS can synthesize for 40K NPUs in just 2.52 hours.
翻译:人工智能,特别是大语言模型的兴起,推动了大规模机器学习集群的快速发展。在这些集群上执行分布式模型常常受到通信开销的限制,因此高效利用可用网络资源至关重要。因此,用于集体通信(即集体算法)的路由算法在决定整体性能方面起着关键作用。遗憾的是,现有的分布式机器学习集体通信库受限于一组固定的基本集体算法。这种限制阻碍了通信优化,尤其是在具有异构和不对称拓扑结构的现代集群中。此外,为所有可能的网络拓扑和集体模式组合手动设计集体算法需要大量的工程和验证工作。为了应对这些挑战,本文提出了TACOS,这是一种能够自动生成针对特定集体模式和网络拓扑的拓扑感知集体算法的自主合成器。TACOS具有高度灵活性,仅需1.08秒即可为异构的128-NPU系统合成一个All-Reduce算法,同时相比最先进的合成器实现了高达4.27倍的性能提升。此外,TACOS展示了更好的可扩展性,其合成时间为多项式级别,而相比之下,NP难方法仅能扩展到数十个NPU的系统。TACOS能够在短短2.52小时内为4万个NPU进行合成。