Simulation offers unique values for both enumeration and extrapolation purposes, and is becoming increasingly important for managing the massive machine learning (ML) clusters and large-scale distributed training jobs. In this paper, we build Echo to tackle three key challenges in large-scale training simulation: (1) tracing the runtime training workloads at each device in an ex-situ fashion so we can use a single device to obtain the actual execution graphs of 1K-GPU training, (2) accurately estimating the collective communication without high overheads of discrete-event based network simulation, and (3) accounting for the interference-induced computation slowdown from overlapping communication and computation kernels on the same device. Echo delivers on average 8% error in training step -- roughly 3x lower than state-of-the-art simulators -- for GPT-175B on a 96-GPU H800 cluster with 3D parallelism on Megatron-LM under 2 minutes.
翻译:模拟在枚举与推演方面具有独特价值,对于管理大规模机器学习集群及分布式训练任务正变得日益重要。本文构建了Echo系统,以应对大规模训练模拟中的三个关键挑战:(1)以异地方式追踪每个设备上的运行时训练负载,从而利用单台设备获取千卡级训练的实际执行图;(2)在避免基于离散事件的网络模拟高开销的前提下,精确估算集体通信时间;(3)量化同一设备上通信与计算内核重叠时因干扰引发的计算减速效应。在配备96张H800 GPU的集群上,针对采用三维并行策略的Megatron-LM框架中GPT-175B模型的模拟,Echo在2分钟内实现了训练步骤时间平均8%的误差——较现有最优模拟器降低约3倍。