Aligning future system design with the ever-increasing compute needs of large language models (LLMs) is undoubtedly an important problem in today's world. Here, we propose a general performance modeling methodology and workload analysis of distributed LLM training and inference through an analytical framework that accurately considers compute, memory sub-system, network, and various parallelization strategies (model parallel, data parallel, pipeline parallel, and sequence parallel). We validate our performance predictions with published data from literature and relevant industry vendors (e.g., NVIDIA). For distributed training, we investigate the memory footprint of LLMs for different activation re-computation methods, dissect the key factors behind the massive performance gain from A100 to B200 ($\sim$ 35x speed-up closely following NVIDIA's scaling trend), and further run a design space exploration at different technology nodes (12 nm to 1 nm) to study the impact of logic, memory, and network scaling on the performance. For inference, we analyze the compute versus memory boundedness of different operations at a matrix-multiply level for different GPU systems and further explore the impact of DRAM memory technology scaling on inference latency. Utilizing our modeling framework, we reveal the evolution of performance bottlenecks for both LLM training and inference with technology scaling, thus, providing insights to design future systems for LLM training and inference.
翻译:将未来系统设计与大语言模型(LLMs)不断增长的计算需求相匹配,无疑是当今世界的一个重要问题。本文提出了一种通用的性能建模方法,并通过一个分析框架对分布式LLM训练和推理进行了工作负载分析,该框架精确考虑了计算、内存子系统、网络以及各种并行化策略(模型并行、数据并行、流水线并行和序列并行)。我们利用文献和相关行业供应商(例如,NVIDIA)已发布的数据验证了我们的性能预测。对于分布式训练,我们研究了不同激活重计算方法下LLMs的内存占用,剖析了从A100到B200性能大幅提升(约35倍加速,与NVIDIA的扩展趋势高度一致)背后的关键因素,并进一步在不同技术节点(12 nm至1 nm)上进行了设计空间探索,以研究逻辑、内存和网络扩展对性能的影响。对于推理,我们在矩阵乘法层面分析了不同GPU系统中不同操作的计算受限与内存受限特性,并进一步探讨了DRAM内存技术扩展对推理延迟的影响。利用我们的建模框架,我们揭示了随着技术扩展,LLM训练和推理性能瓶颈的演变,从而为设计面向未来LLM训练和推理的系统提供了见解。