The explosion of machine learning model size has led to its execution on distributed clusters at a very large scale. Many works have tried to optimize the process of producing collective algorithms and running collective communications, which act as a bottleneck to distributed machine learning. However, different works use their own collective algorithm representation, pushing away from co-optimizing collective communication and the rest of the workload. The lack of a standardized collective algorithm representation has also hindered interoperability between collective algorithm producers and consumers. Additionally, tool-specific conversions and modifications have to be made for each pair of tools producing and consuming collective algorithms which adds to engineering efforts. In this position paper, we propose a standardized workflow leveraging a common collective algorithm representation. Upstream producers and downstream consumers converge to a common representation format based on Chakra Execution Trace, a commonly used graph based representation of distributed machine learning workloads. Such a common representation enables us to view collective communications at the same level as workload operations and decouple producer and consumer tools, enhance interoperability, and relieve the user from the burden of having to focus on downstream implementations. We provide a proof-of-concept of this standardized workflow by simulating collective algorithms generated by the MSCCLang domain-specific language through the ASTRA-sim distributed machine learning simulator using various network configurations.
翻译:机器学习模型规模的爆炸式增长导致其需要在分布式集群上大规模执行。许多研究工作尝试优化生成集体算法和运行集体通信的过程,这两者构成了分布式机器学习的主要瓶颈。然而,不同研究采用各自的集体算法表示方式,阻碍了集体通信与工作负载其他部分的协同优化。标准化集体算法表示的缺失也制约了集体算法生产者与消费者之间的互操作性。此外,针对每对生产与消费集体算法的工具都需要进行特定工具转换和修改,这增加了工程负担。在本立场论文中,我们提出一种利用通用集体算法表示的标准化工作流程。上游生产者和下游消费者将基于Chakra执行轨迹——一种广泛使用的基于图的分布式机器学习工作负载表示法——收敛到通用表示格式。这种通用表示使我们能够在与工作负载操作相同的层级上审视集体通信,解耦生产者与消费者工具,增强互操作性,并减轻用户关注下游实现细节的负担。我们通过使用不同网络配置,将MSCCLang领域特定语言生成的集体算法在ASTRA-sim分布式机器学习模拟器中进行仿真,为此标准化工作流程提供了概念验证。