The performance of stochastic gradient descent (SGD), which is the simplest first-order optimizer for training deep neural networks, depends on not only the learning rate but also the batch size. They both affect the number of iterations and the stochastic first-order oracle (SFO) complexity needed for training. In particular, the previous numerical results indicated that, for SGD using a constant learning rate, the number of iterations needed for training decreases when the batch size increases, and the SFO complexity needed for training is minimized at a critical batch size and that it increases once the batch size exceeds that size. Here, we study the relationship between batch size and the iteration and SFO complexities needed for nonconvex optimization in deep learning with SGD using constant or decaying learning rates and show that SGD using the critical batch size minimizes the SFO complexity. We also provide numerical comparisons of SGD with the existing first-order optimizers and show the usefulness of SGD using a critical batch size. Moreover, we show that measured critical batch sizes are close to the sizes estimated from our theoretical results.
翻译:随机梯度下降(SGD)作为训练深度神经网络的最简单一阶优化器,其性能不仅取决于学习率,还受批量大小影响。两者共同决定了训练所需的迭代次数和随机一阶预言(SFO)复杂度。特别地,先前的数值结果表明,对于使用恒定学习率的SGD,当批量大小增大时训练所需迭代次数减少,而训练所需SFO复杂度在临界批量大小处达到最小值,一旦批量大小超过该值,SFO复杂度将随之增加。本文研究了在深度学习的非凸优化问题中,采用恒定或衰减学习率的SGD的批量大小与迭代复杂度及SFO复杂度之间的关系,并证明了使用临界批量大小的SGD能够最小化SFO复杂度。我们还将SGD与现有的一阶优化器进行了数值比较,展示了使用临界批量大小的SGD的有效性。此外,实验测得的临界批量大小与理论结果估算的数值高度吻合。