Stochastic gradient descent (SGD) is the simplest deep learning optimizer with which to train deep neural networks. While SGD can use various learning rates, such as constant or diminishing rates, the previous numerical results showed that SGD performs better than other deep learning optimizers using when it uses learning rates given by line search methods. In this paper, we perform a convergence analysis on SGD with a learning rate given by an Armijo line search for nonconvex optimization. The analysis indicates that the upper bound of the expectation of the squared norm of the full gradient becomes small when the number of steps and the batch size are large. Next, we show that, for SGD with the Armijo-line-search learning rate, the number of steps needed for nonconvex optimization is a monotone decreasing convex function of the batch size; that is, the number of steps needed for nonconvex optimization decreases as the batch size increases. Furthermore, we show that the stochastic first-order oracle (SFO) complexity, which is the stochastic gradient computation cost, is a convex function of the batch size; that is, there exists a critical batch size that minimizes the SFO complexity. Finally, we provide numerical results that support our theoretical results. The numerical results indicate that the number of steps needed for training deep neural networks decreases as the batch size increases and that there exist the critical batch sizes that can be estimated from the theoretical results.
翻译:随机梯度下降(SGD)是训练深度神经网络最简单的深度学习优化器。虽然SGD可以使用常数或递减等不同学习率,但先前的数值结果表明,当使用线搜索方法确定的学习率时,SGD的表现优于其他深度学习优化器。本文针对使用Armijo线搜索确定学习率的SGD进行非凸优化的收敛性分析。分析表明,当步数和批量大小较大时,全梯度平方范数的期望上界会变小。其次,我们证明,对于采用Armijo线搜索学习率的SGD,非凸优化所需的步数是批量大小的单调递减凸函数;也就是说,随着批量大小增加,非凸优化所需步数减少。此外,我们证明随机一阶预言机(SFO)复杂度(即随机梯度计算成本)是批量大小的凸函数;这意味着存在一个临界批量大小可最小化SFO复杂度。最后,我们提供支持理论结果的数值实验。数值结果表明,训练深度神经网络所需的步数随批量大小增加而减少,并且存在可通过理论结果估计的临界批量大小。