Large model training beyond tens of thousands of GPUs is an uncharted territory. At such scales, disruptions to the training process are not a matter of if, but a matter of when -- a stochastic process degrading training productivity. Dynamic runtime variation will become increasingly more frequent as training scales up and as GPUs are operated in increasingly power-limited and thermally-stressed environments. At the 64,000+ GPU scale, we already observe 9% GPU time variability for frontier foundation model training. Motivated by our analysis and the large design space around performance variability, we present PRISM -- a performance modeling framework that captures the stochastic nature of large-scale distributed training. The core of PRISM is a statistical method that quantifies probabilistic guarantees on training time. Using PRISM, we explore the design and optimization space of distributed training, enabling principled, variability-aware decisions that improve performance and system efficiency at scale.
翻译:在超过数万张GPU规模上进行大模型训练仍是未知领域。在此类规模下,训练过程的中断并非是否会发生的问题,而是何时发生的问题——这是一个降低训练效率的随机过程。随着训练规模扩大以及GPU在日益受限的供电和热应力环境下运行,动态运行时变异性将愈发频繁。在64,000+GPU规模上,我们已观察到前沿基础模型训练的GPU时间波动达9%。受此分析及围绕性能变异性的广阔设计空间启发,我们提出PRISM——一种捕捉大规模分布式训练随机特性的性能建模框架。PRISM的核心是一种量化训练时间概率保证的统计方法。借助PRISM,我们探索了分布式训练的设计与优化空间,从而能够做出基于原理且感知变异性的决策,提升大规模场景下的性能与系统效率。