Despite the omnipresent use of stochastic gradient descent (SGD) optimization methods in the training of deep neural networks (DNNs), it remains, in basically all practically relevant scenarios, a fundamental open problem to provide a rigorous theoretical explanation for the success (and the limitations) of SGD optimization methods in deep learning. In particular, it remains an open question to prove or disprove convergence of the true risk of SGD optimization methods to the optimal true risk value in the training of DNNs. In one of the main results of this work we reveal for a general class of activations, loss functions, random initializations, and SGD optimization methods (including, for example, standard SGD, momentum SGD, Nesterov accelerated SGD, Adagrad, RMSprop, Adadelta, Adam, Adamax, Nadam, Nadamax, and AMSGrad) that in the training of any arbitrary fully-connected feedforward DNN it does not hold that the true risk of the considered optimizer converges in probability to the optimal true risk value. Nonetheless, the true risk of the considered SGD optimization method may very well converge to a strictly suboptimal true risk value.
翻译:尽管随机梯度下降(SGD)优化方法在深度神经网络(DNN)训练中无处不在,但在几乎所有实际相关场景中,如何为深度学习中的SGD优化方法的成功(及其局限性)提供严格的理论解释,仍然是一个根本性的开放问题。特别是,在DNN训练中,证明或证伪SGD优化方法的真实风险收敛至最优真实风险值,仍是一个开放性问题。在本工作的主要结果之一中,我们针对一大类激活函数、损失函数、随机初始化方法以及SGD优化方法(例如,包括标准SGD、动量SGD、Nesterov加速SGD、Adagrad、RMSprop、Adadelta、Adam、Adamax、Nadam、Nadamax和AMSGrad)揭示:在训练任意全连接前馈DNN时,所考虑优化器的真实风险依概率收敛至最优真实风险值这一结论并不成立。尽管如此,所考虑的SGD优化方法的真实风险很可能收敛至一个严格次优的真实风险值。