A Few GPUs, A Whole Lotta Scale: Faithful LLM Training Emulation with PrismLLM

Shaoke Xi,ChonLam Lao,Boyi Jia,Jiaqi Gao,Zhipeng Zhang,Jiamin Cao,Brian Sutioso,Erci Xu,Minlan Yu,Kui Ren,Yong Li,Zhengping Qian,Ennan Zhai,Jingren Zhou

from arxiv, 13 pages body, 21 pages total

Large language model (LLM) training today runs on clusters spanning thousands of GPUs. While this scale enables rapid model advances, developing, debugging, and performance-tuning the training framework inevitably becomes complex and costly. This is because engineers often need to reproduce production behaviors to diagnose failures or evaluate optimizations, thereby demanding frequent and even exclusive access to production-scale clusters -- which becomes increasingly hard given that the majority of GPUs are already committed to production workloads. Simulation relies on complex performance models that are difficult to maintain, and downscaled experiments often fail to capture scale-dependent behaviors. We present PrismLLM to decouple large-scale execution from the need to access large clusters, enabling engineers to run and observe ranks of interest under faithful large-scale behavior using only a few GPUs. PrismLLM constructs a high-fidelity execution graph via a slicing-based approach that captures computation, communication, and dependencies of the target scale. Then, PrismLLM performs hybrid emulation where selected ranks execute the original program while the remaining ranks are replayed as virtual participants. Experiments on large-scale LLM training workloads show that PrismLLM accurately reproduces performance and memory behavior, achieving only 0.58\% average error in iteration time and less than 0.01\% error in peak GPU memory usage. PrismLLM can emulate clusters of up to 8192 GPUs using fewer than 1\% of the physical GPUs required by the original deployment.

翻译：如今，大语言模型（LLM）的训练通常运行在跨越数千个GPU的集群上。这种规模虽然能够加速模型进步，但同时也使得训练框架的开发、调试和性能优化变得复杂且成本高昂。这是因为工程师需要复现生产环境中的行为来诊断故障或评估优化方案，从而需要频繁甚至独占使用生产规模集群——而鉴于绝大多数GPU已用于生产负载，这变得越来越困难。模拟方法依赖于难以维护的复杂性能模型，而缩小的实验往往无法捕捉到依赖于规模的行为。我们提出PrismLLM，将大规模执行与对大集群的访问需求解耦，使工程师仅需使用少量GPU即可在高保真大规模行为下运行和观测感兴趣的 ranks。PrismLLM 通过基于切分的方法构建高保真执行图，捕获目标规模下的计算、通信和依赖关系。随后，PrismLLM 执行混合模拟：选定的 ranks 执行原始程序，而其余 ranks 作为虚拟参与者被重放。针对大规模LLM训练负载的实验表明，PrismLLM 能准确复现性能与内存行为，迭代时间平均误差仅为0.58%，GPU峰值内存使用误差低于0.01%。PrismLLM 可使用不到原始部署所需物理GPU数量1%的资源，模拟多达8192个GPU的集群。