As deep learning models and input data are scaling at an unprecedented rate, it is inevitable to move towards distributed training platforms to fit the model and increase training throughput. State-of-the-art approaches and techniques, such as wafer-scale nodes, multi-dimensional network topologies, disaggregated memory systems, and parallelization strategies, have been actively adopted by emerging distributed training systems. This results in a complex SW/HW co-design stack of distributed training, necessitating a modeling/simulation infrastructure for design-space exploration. In this paper, we extend the open-source ASTRA-sim infrastructure and endow it with the capabilities to model state-of-the-art and emerging distributed training models and platforms. More specifically, (i) we enable ASTRA-sim to support arbitrary model parallelization strategies via a graph-based training-loop implementation, (ii) we implement a parameterizable multi-dimensional heterogeneous topology generation infrastructure with analytical performance estimates enabling simulating target systems at scale, and (iii) we enhance the memory system modeling to support accurate modeling of in-network collective communication and disaggregated memory systems. With such capabilities, we run comprehensive case studies targeting emerging distributed models and platforms. This infrastructure lets system designers swiftly traverse the complex co-design stack and give meaningful insights when designing and deploying distributed training platforms at scale.
翻译:随着深度学习模型和输入数据以前所未有的规模扩展,分布式训练平台已成为适配模型规模并提升训练吞吐量的必然选择。当前前沿方法与技术,如晶圆级节点、多维网络拓扑、解耦存储系统及并行化策略,已被新兴分布式训练系统广泛采用。这形成了复杂的分布式训练软硬件协同设计栈,亟需用于设计空间探索的建模/仿真基础设施。本文对开源ASTRA-sim基础设施进行扩展,赋予其建模当前最前沿及新兴分布式训练模型与平台的能力。具体而言:(i)通过基于图的训练循环实现,使ASTRA-sim支持任意模型并行化策略;(ii)构建参数化的多维异构拓扑生成基础设施,结合分析性能预估实现目标系统的大规模仿真;(iii)增强存储系统建模能力,支持网络内集合通信与解耦存储系统的精准建模。凭借上述能力,我们针对新兴分布式模型与平台开展了全面案例研究。该基础设施使系统设计者能够快速遍历复杂的协同设计栈,并在大规模设计部署分布式训练平台时获取具有指导意义的关键洞察。