TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning

The surge of artificial intelligence, specifically large language models, has led to a rapid advent towards the development of large-scale machine learning training clusters. Collective communications within these clusters tend to be heavily bandwidth-bound, necessitating techniques to optimally utilize the available network bandwidth. This puts the routing algorithm for the collective at the forefront of determining the performance. Unfortunately, communication libraries used in distributed machine learning today are limited by a fixed set of routing algorithms. This constraints collective performance within the domain of next-generation training clusters that employ intricate, heterogeneous, and asymmetric, large-scale topologies. Further, the emergence of irregular topologies attributed to runtime phenomena such as device failures serves to compound the complexity of the challenge. To this end, this paper introduces TACOS, an automated synthesizer that generates topology-aware collective algorithms for common distributed machine learning collectives across arbitrary input network topologies. TACOS was able to synthesize All-Reduce algorithm for a heterogeneous 512-NPU system in just 6.09 minutes while achieving performance improvement up to 4.27x over state-of-the-art prior work. TACOS exhibits high scalability, with synthesis time scaling quadratically with the number of NPUs. In contrast to prior works' NP-hard approaches, TACOS with 40K NPUs completes in 2.52 hours.

翻译：人工智能的快速发展，特别是大语言模型的涌现，推动了大规模机器学习训练集群的加速建设。集群内的集合通信严重受限于带宽，亟需技术手段以最优方式利用可用网络带宽，这使得集合通信的路由算法成为决定性能的关键因素。然而，当前分布式机器学习中使用的通信库仅支持固定的路由算法集合，这限制了在采用复杂、异构、非对称大规模拓扑的下一代训练集群中的集合通信性能。此外，设备故障等运行时现象导致的非规则拓扑进一步加剧了这一挑战。为此，本文提出TACOS——一种自动综合器，可为任意输入网络拓扑上的常见分布式机器学习集合操作生成拓扑感知的集合算法。TACOS在异构512 NPU系统上仅需6.09分钟即可综合出All-Reduce算法，相比现有最优工作实现高达4.27倍的性能提升。TACOS展现出高可扩展性，其综合时间与NPU数量呈二次方增长。与现有工作中NP难方法形成对比，TACOS在40K NPU规模下仅需2.52小时即可完成综合。

相关内容

Machine Learning

关注 2251

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日