Training and deploying large machine learning (ML) models is time-consuming and requires significant distributed computing infrastructures. Based on real-world large model training on datacenter-scale infrastructures, we show 14~32% of all GPU hours are spent on communication with no overlapping computation. To minimize the outstanding communication latency, in this work, we develop an agile performance modeling framework to guide parallelization and hardware-software co-design strategies. Using the suite of real-world large ML models on state-of-the-art GPU training hardware, we demonstrate 2.24x and 5.27x throughput improvement potential for pre-training and inference scenarios, respectively.
翻译:训练和部署大型机器学习(ML)模型耗时巨大,且需要大规模分布式计算基础设施。基于数据中心级基础设施上的实际大规模模型训练,我们发现14%~32%的GPU时间被用于无计算重叠的通信。为最小化显著的通信延迟,本研究开发了一种敏捷性能建模框架,用于指导并行化及软硬件协同设计策略。通过在实际大规模ML模型套件与最新GPU训练硬件上的验证,我们分别在预训练和推理场景中实现了2.24倍和5.27倍的吞吐量提升潜力。