Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training

Dramatic increases in the capabilities of neural network models in recent years are driven by scaling model size, training data, and corresponding computational resources. To develop the exceedingly large networks required in modern applications, such as large language models (LLMs), model training is distributed across tens of thousands of hardware accelerators (e.g. GPUs), requiring orchestration of computation and communication across large computing clusters. In this work, we demonstrate that careful consideration of hardware configuration and parallelization strategy is critical for effective (i.e. compute- and cost-efficient) scaling of model size, training data, and total computation. We conduct an extensive empirical study of the performance of large-scale LLM training workloads across model size, hardware configurations, and distributed parallelization strategies. We demonstrate that: (1) beyond certain scales, overhead incurred from certain distributed communication strategies leads parallelization strategies previously thought to be sub-optimal in fact become preferable; and (2) scaling the total number of accelerators for large model training quickly yields diminishing returns even when hardware and parallelization strategies are properly optimized, implying poor marginal performance per additional unit of power or GPU-hour.

翻译：近年来神经网络模型能力的显著提升，主要得益于模型规模、训练数据及相应计算资源的扩展。为构建现代应用（如大语言模型，LLMs）所需的超大规模网络，模型训练需在数万个硬件加速器（例如GPU）上进行分布式部署，这要求对大规模计算集群中的计算与通信进行协同调度。本研究证明，硬件配置与并行化策略的审慎考量对于模型规模、训练数据及总体计算的有效（即计算与成本高效）扩展至关重要。我们对大规模LLM训练任务在不同模型规模、硬件配置及分布式并行化策略下的性能进行了广泛的实证研究。结果表明：（1）超过特定规模后，某些分布式通信策略引入的开销会导致先前被认为次优的并行化策略实际上更为可取；（2）即使硬件与并行化策略经过充分优化，为大规模模型训练增加加速器总数仍会迅速导致收益递减，这意味着每增加单位功耗或GPU时长的边际性能表现较差。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/