The surge of artificial intelligence, specifically large language models, has led to a rapid advent towards the development of large-scale machine learning training clusters. Collective communications within these clusters tend to be heavily bandwidth-bound, necessitating techniques to optimally utilize the available network bandwidth. This puts the routing algorithm for the collective at the forefront of determining the performance. Unfortunately, communication libraries used in distributed machine learning today are limited by a fixed set of routing algorithms. This constraints collective performance within the domain of next-generation training clusters that employ intricate, heterogeneous, and asymmetric, large-scale topologies. Further, the emergence of irregular topologies attributed to runtime phenomena such as device failures serves to compound the complexity of the challenge. To this end, this paper introduces TACOS, an automated synthesizer that generates topology-aware collective algorithms for common distributed machine learning collectives across arbitrary input network topologies. TACOS was able to synthesize All-Reduce algorithm for a heterogeneous 512-NPU system in just 6.09 minutes while achieving performance improvement up to 4.27x over state-of-the-art prior work. TACOS exhibits high scalability, with synthesis time scaling quadratically with the number of NPUs. In contrast to prior works' NP-hard approaches, TACOS with 40K NPUs completes in 2.52 hours.
翻译:人工智能的快速发展,特别是大语言模型的涌现,推动了大规模机器学习训练集群的加速建设。集群内的集合通信严重受限于带宽,亟需技术手段以最优方式利用可用网络带宽,这使得集合通信的路由算法成为决定性能的关键因素。然而,当前分布式机器学习中使用的通信库仅支持固定的路由算法集合,这限制了在采用复杂、异构、非对称大规模拓扑的下一代训练集群中的集合通信性能。此外,设备故障等运行时现象导致的非规则拓扑进一步加剧了这一挑战。为此,本文提出TACOS——一种自动综合器,可为任意输入网络拓扑上的常见分布式机器学习集合操作生成拓扑感知的集合算法。TACOS在异构512 NPU系统上仅需6.09分钟即可综合出All-Reduce算法,相比现有最优工作实现高达4.27倍的性能提升。TACOS展现出高可扩展性,其综合时间与NPU数量呈二次方增长。与现有工作中NP难方法形成对比,TACOS在40K NPU规模下仅需2.52小时即可完成综合。