An appropriate choice of batch sizes in large-scale model training is crucial, yet it involves an intrinsic yet inevitable dilemma: large-batch training improves training efficiency in terms of memory utilization, while generalization performance often deteriorates due to small amounts of gradient noise. Despite this dilemma, the common practice of choosing batch sizes in language model training often prioritizes training efficiency -- employing either constant large sizes with data parallelism or implementing batch size warmup schedules. However, such batch size schedule designs remain heuristic and often fail to adapt to training dynamics, presenting the challenge of designing adaptive batch size schedules. Given the abundance of available datasets and the data-hungry nature of language models, data parallelism has become an indispensable distributed training paradigm, enabling the use of larger batch sizes for gradient computation. However, vanilla data parallelism requires replicas of model parameters, gradients, and optimizer states at each worker, which prohibits training larger models with billions of parameters. To optimize memory usage, more advanced parallelism strategies must be employed. In this work, we propose general-purpose and theoretically principled adaptive batch size schedules compatible with data parallelism and model parallelism. We develop a practical implementation with PyTorch Fully Sharded Data Parallel, facilitating the pretraining of language models of different sizes. We empirically demonstrate that our proposed approaches outperform constant batch sizes and heuristic batch size warmup schedules in the pretraining of models in the Llama family, with particular focus on smaller models with up to 3 billion parameters. We also establish theoretical convergence guarantees for such adaptive batch size schedules with Adam for general smooth nonconvex objectives.
翻译:在大规模模型训练中,选择合适的批次大小至关重要,但其涉及一个内在且不可避免的困境:大批次训练在内存利用率方面提升了训练效率,而由于梯度噪声较小,泛化性能却常常下降。尽管存在这一困境,语言模型训练中选择批次大小的常见实践往往优先考虑训练效率——采用数据并行下的恒定大批次或实施批次大小预热调度。然而,此类批次大小调度设计仍具有启发性,且往往无法适应训练动态,这为设计自适应批次大小调度方案带来了挑战。鉴于可用数据集的丰富性以及语言模型对数据的高度需求,数据并行已成为不可或缺的分布式训练范式,使得能够使用更大的批次大小进行梯度计算。然而,原始数据并行需要在每个工作节点上复制模型参数、梯度和优化器状态,这阻碍了训练具有数十亿参数的大型模型。为优化内存使用,必须采用更先进的并行策略。在本工作中,我们提出了与数据并行和模型并行兼容的通用且具有理论原则的自适应批次大小调度方案。我们基于PyTorch Fully Sharded Data Parallel开发了实际实现,促进了不同规模语言模型的预训练。我们通过实证证明,在Llama系列模型的预训练中,特别是在参数规模达30亿的较小模型上,我们提出的方法优于恒定批次大小和启发式批次大小预热调度。我们还为Adam优化器在一般光滑非凸目标函数下使用此类自适应批次大小调度建立了理论收敛保证。