Dramatic increases in the capabilities of neural network models in recent years are driven by scaling model size, training data, and corresponding computational resources. To develop the exceedingly large networks required in modern applications, such as large language models (LLMs), model training is distributed across tens of thousands of hardware accelerators (e.g. GPUs), requiring orchestration of computation and communication across large computing clusters. In this work, we demonstrate that careful consideration of hardware configuration and parallelization strategy is critical for effective (i.e. compute- and cost-efficient) scaling of model size, training data, and total computation. We conduct an extensive empirical study of the performance of large-scale LLM training workloads across model size, hardware configurations, and distributed parallelization strategies. We demonstrate that: (1) beyond certain scales, overhead incurred from certain distributed communication strategies leads parallelization strategies previously thought to be sub-optimal in fact become preferable; and (2) scaling the total number of accelerators for large model training quickly yields diminishing returns even when hardware and parallelization strategies are properly optimized, implying poor marginal performance per additional unit of power or GPU-hour.
翻译:近年来神经网络模型能力的显著提升,主要得益于模型规模、训练数据及相应计算资源的扩展。为构建现代应用(如大语言模型,LLMs)所需的超大规模网络,模型训练需在数万个硬件加速器(例如GPU)上进行分布式部署,这要求对大规模计算集群中的计算与通信进行协同调度。本研究证明,硬件配置与并行化策略的审慎考量对于模型规模、训练数据及总体计算的有效(即计算与成本高效)扩展至关重要。我们对大规模LLM训练任务在不同模型规模、硬件配置及分布式并行化策略下的性能进行了广泛的实证研究。结果表明:(1)超过特定规模后,某些分布式通信策略引入的开销会导致先前被认为次优的并行化策略实际上更为可取;(2)即使硬件与并行化策略经过充分优化,为大规模模型训练增加加速器总数仍会迅速导致收益递减,这意味着每增加单位功耗或GPU时长的边际性能表现较差。