数据与模型并行下语言模型分布式训练的自适应批次大小调度方案 (Adaptive Batch Size Schedules for Distributed Training of Language Models with Data and Model Parallelism)

An appropriate choice of batch sizes in large-scale model training is crucial, yet it involves an intrinsic yet inevitable dilemma: large-batch training improves training efficiency in terms of memory utilization, while generalization performance often deteriorates due to small amounts of gradient noise. Despite this dilemma, the common practice of choosing batch sizes in language model training often prioritizes training efficiency -- employing either constant large sizes with data parallelism or implementing batch size warmup schedules. However, such batch size schedule designs remain heuristic and often fail to adapt to training dynamics, presenting the challenge of designing adaptive batch size schedules. Given the abundance of available datasets and the data-hungry nature of language models, data parallelism has become an indispensable distributed training paradigm, enabling the use of larger batch sizes for gradient computation. However, vanilla data parallelism requires replicas of model parameters, gradients, and optimizer states at each worker, which prohibits training larger models with billions of parameters. To optimize memory usage, more advanced parallelism strategies must be employed. In this work, we propose general-purpose and theoretically principled adaptive batch size schedules compatible with data parallelism and model parallelism. We develop a practical implementation with PyTorch Fully Sharded Data Parallel, facilitating the pretraining of language models of different sizes. We empirically demonstrate that our proposed approaches outperform constant batch sizes and heuristic batch size warmup schedules in the pretraining of models in the Llama family, with particular focus on smaller models with up to 3 billion parameters. We also establish theoretical convergence guarantees for such adaptive batch size schedules with Adam for general smooth nonconvex objectives.

翻译：在大规模模型训练中，选择合适的批次大小至关重要，但其涉及一个内在且不可避免的困境：大批次训练在内存利用率方面提升了训练效率，而由于梯度噪声较小，泛化性能却常常下降。尽管存在这一困境，语言模型训练中选择批次大小的常见实践往往优先考虑训练效率——采用数据并行下的恒定大批次或实施批次大小预热调度。然而，此类批次大小调度设计仍具有启发性，且往往无法适应训练动态，这为设计自适应批次大小调度方案带来了挑战。鉴于可用数据集的丰富性以及语言模型对数据的高度需求，数据并行已成为不可或缺的分布式训练范式，使得能够使用更大的批次大小进行梯度计算。然而，原始数据并行需要在每个工作节点上复制模型参数、梯度和优化器状态，这阻碍了训练具有数十亿参数的大型模型。为优化内存使用，必须采用更先进的并行策略。在本工作中，我们提出了与数据并行和模型并行兼容的通用且具有理论原则的自适应批次大小调度方案。我们基于PyTorch Fully Sharded Data Parallel开发了实际实现，促进了不同规模语言模型的预训练。我们通过实证证明，在Llama系列模型的预训练中，特别是在参数规模达30亿的较小模型上，我们提出的方法优于恒定批次大小和启发式批次大小预热调度。我们还为Adam优化器在一般光滑非凸目标函数下使用此类自适应批次大小调度建立了理论收敛保证。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日