A Scalable Recipe on SuperMUC-NG Phase 2: Efficient Large-Scale Training of Language Models

Large Language Models (LLMs) continue to demonstrate superior performance with increasing scale, yet training models with billions to trillions of parameters requires staggering computational resources, e.g. a one-trillion-parameter GPT-style model requires an estimated 120 million exaflops. This challenge necessitates efficient distributed training strategies on cutting-edge High-Performance Computing (HPC) infrastructure. In this work, we explore the SuperMUC-NG Phase 2 (SMNG-P2) system at the Leibniz Supercomputing Centre (LRZ) in Garching, Germany, equipped with Intel Data Center GPU Max 1550 accelerators to extract the necessary computational power. We enable and investigate a comprehensive recipe of parallel training techniques, including tensor parallelism, pipeline parallelism, and sharded data parallelism, essential for facilitating the training of LLMs up to 175 billion-parameter scale on SMNG-P2. Through empirical assessment and extensive hyperparameter tuning, we analyze the complex interplay among these techniques and determine their impact on GPU computational efficiency. We identify an optimized combined strategy that yields high throughput and enables the efficient training of LLMs of varying sizes. Specifically, for the 175B model, we achieved per-tile throughput of 10% of theoretical peak per-tile bf16 FLOPs, employing an out-of-the-box publicly available software stack, utilizing standard distributions without further modification. This approach ensures broad accessibility, as our methodology can be replicated by any user on SMNG-P2 system without need for porting or specialized software engineering. Furthermore, we achieved 93% weak scaling efficiency and strong scaling efficiency of 82% on 128 nodes. This scalable recipe provides a crucial blueprint for efficiently utilizing advanced exascale systems for next-generation foundational model development.

翻译：大语言模型（LLMs）随着规模增大持续展现出更优性能，但训练数十亿至数万亿参数的模型需要惊人的计算资源，例如一个万亿参数的GPT风格模型预计需要1.2亿百亿亿次浮点运算。这一挑战要求在尖端高性能计算（HPC）基础设施上采用高效的分布式训练策略。本研究探索了位于德国加兴莱布尼茨超级计算中心（LRZ）配备英特尔数据中心GPU Max 1550加速器的SuperMUC-NG第二阶段（SMNG-P2）系统，以获取必要的计算能力。我们启用并研究了一套全面的并行训练技术方案，包括张量并行、流水线并行和分片数据并行，这些技术对于在SMNG-P2上训练高达1750亿参数规模的语言模型至关重要。通过实证评估和广泛的超参数调优，我们分析了这些技术间的复杂交互作用，并确定了它们对GPU计算效率的影响。我们识别出一种优化的组合策略，该策略能够实现高吞吐量，并支持不同规模语言模型的高效训练。具体而言，针对175B模型，我们采用现成的公开可用软件栈（使用标准发行版且无需进一步修改），实现了每瓦片理论峰值bf16 FLOPs 10%的每瓦片吞吐量。这种方法保证了广泛的可访问性，因为任何SMNG-P2系统用户无需移植或专用软件工程即可复现我们的方法。此外，我们在128个节点上实现了93%的弱扩展效率和82%的强扩展效率。这一可扩展方案为利用先进百亿亿次系统高效开发下一代基础模型提供了关键蓝图。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

什么是后训练？大语言模型训练后优化方法综述，87页pdf

专知会员服务

54+阅读 · 2025年3月11日

【博士论文】朝向大规模语言模型的原则性训练与服务

专知会员服务

10+阅读 · 2025年2月10日

【NeurIPS2024】《AmoebaLLM：构建任意形状的大型语言模型以实现高效和即时部署》

专知会员服务

22+阅读 · 2024年11月21日

【伯克利博士论文】《通过高效和自动化系统赋能大型语言模型》，154页pdf

专知会员服务

20+阅读 · 2024年9月3日