The performance of mini-batch stochastic gradient descent (SGD) strongly depends on setting the batch size and learning rate to minimize the empirical loss in training the deep neural network. In this paper, we present theoretical analyses of mini-batch SGD with four schedulers: (i) constant batch size and decaying learning rate scheduler, (ii) increasing batch size and decaying learning rate scheduler, (iii) increasing batch size and increasing learning rate scheduler, and (iv) increasing batch size and warm-up decaying learning rate scheduler. We show that mini-batch SGD using scheduler (i) does not always minimize the expectation of the full gradient norm of the empirical loss, whereas it does using any of schedulers (ii), (iii), and (iv). Furthermore, schedulers (iii) and (iv) accelerate mini-batch SGD. The paper also provides numerical results of supporting analyses showing that using scheduler (iii) or (iv) minimizes the full gradient norm of the empirical loss faster than using scheduler (i) or (ii).
翻译:小批量随机梯度下降(SGD)的性能在很大程度上取决于批次大小和学习率的设置,以最小化训练深度神经网络时的经验损失。本文对小批量SGD的四种调度策略进行了理论分析:(i)恒定批次大小与衰减学习率调度器,(ii)增大批次大小与衰减学习率调度器,(iii)增大批次大小与增大学习率调度器,以及(iv)增大批次大小与预热衰减学习率调度器。我们证明,使用调度器(i)的小批量SGD并不总能最小化经验损失的完整梯度范数的期望,而使用调度器(ii)、(iii)或(iv)中的任意一种均可实现此目标。此外,调度器(iii)和(iv)能够加速小批量SGD。本文还提供了支持分析的数值结果,表明使用调度器(iii)或(iv)比使用调度器(i)或(ii)能更快地最小化经验损失的完整梯度范数。