Training with larger mini-batches improves the performance and convergence rate of training machine learning models. However, training with large mini-batches becomes prohibitive for Large Language Models (LLMs) with billions of parameters, due to the large GPU memory requirement. To address this problem, we propose finding small mini-batches that simulate the dynamics of training with larger mini-batches. Specifically, we formulate selecting smaller mini-batches of examples that closely capture gradients of large mini-batches as a submodular maximization problem. Nevertheless, the very large dimensionality of the gradients makes the problem very challenging to solve. To address this, we leverage ideas from zeroth-order optimization and neural network pruning to find lower-dimensional gradient estimates that allow finding high-quality subsets effectively with a limited amount of memory. We prove the superior convergence rate of training on the small mini-batches found by our method and empirically show its effectiveness. Our method can effectively reduce the memory requirement by 2x and speed up training by 1.3x, as we confirm for fine-tuning Phi-2 on MathInstruct. Our method can be easily stacked with LoRA and other memory-efficient methods to further reduce the memory requirements of training LLMs.
翻译:采用更大批次规模进行训练能够提升机器学习模型的性能与收敛速度。然而,对于具有数十亿参数的大语言模型(LLMs),由于巨大的GPU内存需求,使用大批次训练变得难以实现。为解决此问题,我们提出寻找能够模拟大批次训练动态的小批次样本。具体而言,我们将选择能紧密捕捉大批次梯度信息的小批次样本问题,形式化为一个子模最大化问题。然而,梯度的极高维度使得该问题求解极具挑战性。为此,我们借鉴零阶优化与神经网络剪枝的思想,通过寻找低维梯度估计量,在有限内存条件下高效获取高质量样本子集。我们证明了采用本方法选定的小批次进行训练具有更优的收敛速率,并通过实验验证了其有效性。在MathInstruct数据集上微调Phi-2的实验中,我们的方法可有效降低2倍内存需求并加速训练1.3倍。该方法可轻松与LoRA及其他内存高效方法结合使用,进一步降低LLMs训练的内存需求。