Mini-batch stochastic gradient descent (SGD) and variants thereof approximate the objective function's gradient with a small number of training examples, aka the batch size. Small batch sizes require little computation for each model update but can yield high-variance gradient estimates, which poses some challenges for optimization. Conversely, large batches require more computation but can yield higher precision gradient estimates. This work presents a method to adapt the batch size to the model's training loss. For various function classes, we show that our method requires the same order of model updates as gradient descent while requiring the same order of gradient computations as SGD. This method requires evaluating the model's loss on the entire dataset every model update. However, the required computation is greatly reduced by approximating the training loss. We provide experiments that illustrate our methods require fewer model updates without increasing the total amount of computation.
翻译:小批量随机梯度下降(SGD)及其变体通过少量训练样本(即批次大小)近似目标函数的梯度。小批次大小每次模型更新所需计算量较小,但梯度估计方差较高,给优化带来挑战;相反,大批次需要更多计算,但能获得更高精度的梯度估计。本文提出一种根据模型训练损失自适应调整批次大小的方法。针对多种函数类别,我们证明该方法所需的模型更新次数与梯度下降法同阶,而梯度计算次数与SGD同阶。该方法需要在每次模型更新时对整个数据集评估模型损失,但通过近似训练损失可大幅降低所需计算量。实验表明,我们的方法在不增加总计算量的前提下减少了模型更新次数。