TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning

The surge of artificial intelligence, particularly large language models, has driven the rapid development of large-scale machine learning clusters. Executing distributed models on these clusters is often constrained by communication overhead, making efficient utilization of available network resources crucial. As a result, the routing algorithm employed for collective communications (i.e., collective algorithms) plays a pivotal role in determining overall performance. Unfortunately, existing collective communication libraries for distributed machine learning are limited by a fixed set of basic collective algorithms. This limitation hinders communication optimization, especially in modern clusters with heterogeneous and asymmetric topologies. Furthermore, manually designing collective algorithms for all possible combinations of network topologies and collective patterns requires heavy engineering and validation efforts. To address these challenges, this paper presents TACOS, an autonomous synthesizer capable of automatically generating topology-aware collective algorithms tailored to specific collective patterns and network topologies. TACOS is highly flexible, synthesizing an All-Reduce algorithm for a heterogeneous 128-NPU system in just 1.08 seconds, while achieving up to a 4.27x performance improvement over state-of-the-art synthesizers. Additionally, TACOS demonstrates better scalability with polynomial synthesis times, in contrast to NP-hard approaches which only scale to systems with tens of NPUs. TACOS can synthesize for 40K NPUs in just 2.52 hours.

翻译：人工智能，特别是大语言模型的兴起，推动了大规模机器学习集群的快速发展。在这些集群上执行分布式模型常常受到通信开销的限制，因此高效利用可用网络资源至关重要。因此，用于集体通信（即集体算法）的路由算法在决定整体性能方面起着关键作用。遗憾的是，现有的分布式机器学习集体通信库受限于一组固定的基本集体算法。这种限制阻碍了通信优化，尤其是在具有异构和不对称拓扑结构的现代集群中。此外，为所有可能的网络拓扑和集体模式组合手动设计集体算法需要大量的工程和验证工作。为了应对这些挑战，本文提出了TACOS，这是一种能够自动生成针对特定集体模式和网络拓扑的拓扑感知集体算法的自主合成器。TACOS具有高度灵活性，仅需1.08秒即可为异构的128-NPU系统合成一个All-Reduce算法，同时相比最先进的合成器实现了高达4.27倍的性能提升。此外，TACOS展示了更好的可扩展性，其合成时间为多项式级别，而相比之下，NP难方法仅能扩展到数十个NPU的系统。TACOS能够在短短2.52小时内为4万个NPU进行合成。

相关内容

Machine Learning

关注 2249

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日