We present a theoretical model of distributed training, and use it to analyze how far dense and sparse training runs can be scaled. Under our baseline assumptions, given a three month training duration, data movement bottlenecks begin to significantly lower hardware utilization for training runs exceeding about $10^{28}$ FLOP, two orders of magnitude above the largest training run to date, \textbf{suggesting the arrival of fundamental barriers to scaling in three years} given recent rates of growth. A training run exceeding about $10^{31}$ FLOP is infeasible even at low utilization. However, more aggressive batch size scaling and/or shorter and fatter model shapes, if achievable, have the potential to permit much larger training runs.
翻译:我们提出了一个分布式训练的理论模型,并利用该模型分析了稠密训练与稀疏训练能够扩展的极限。在我们的基线假设下,给定三个月的训练时长,当训练计算量超过约$10^{28}$ FLOP(比迄今为止最大的训练运行量高出两个数量级)时,数据移动瓶颈将开始显著降低硬件利用率,**这表明按照近期的增长速度,三年内将面临扩展的根本性障碍**。即使利用率很低,超过约$10^{31}$ FLOP的训练运行也是不可行的。然而,若能实现更激进的批量大小扩展和/或更短、更宽的模型结构,则有可能支持更大规模的训练运行。