In many numerical simulations stochastic gradient descent (SGD) type optimization methods perform very effectively in the training of deep neural networks (DNNs) but till this day it remains an open problem of research to provide a mathematical convergence analysis which rigorously explains the success of SGD type optimization methods in the training of DNNs. In this work we study SGD type optimization methods in the training of fully-connected feedforward DNNs with rectified linear unit (ReLU) activation. We first establish general regularity properties for the risk functions and their generalized gradient functions appearing in the training of such DNNs and, thereafter, we investigate the plain vanilla SGD optimization method in the training of such DNNs under the assumption that the target function under consideration is a constant function. Specifically, we prove under the assumption that the learning rates (the step sizes of the SGD optimization method) are sufficiently small but not $L^1$-summable and under the assumption that the target function is a constant function that the expectation of the riskof the considered SGD process converges in the training of such DNNs to zero as the number of SGD steps increases to infinity.
翻译:在许多数值模拟中,随机梯度下降(SGD)类优化方法在深度神经网络(DNNs)训练中表现非常有效,但至今为止,为严格解释SGD类优化方法在DNNs训练中成功提供数学收敛性分析,仍然是一个开放的研究问题。本文研究了在全连接前馈DNNs(使用修正线性单元(ReLU)激活函数)训练中的SGD类优化方法。我们首先建立了此类DNNs训练中出现的风险函数及其广义梯度函数的一般正则性性质,随后,在考虑的目标函数为常数函数的假设下,研究了此类DNNs训练中的标准SGD优化方法。具体而言,我们证明,在学习率(SGD优化方法的步长)足够小但非$L^1$-可求和,且目标函数为常数函数的假设下,所考虑的SGD过程的期望风险随着SGD步数趋于无穷而收敛到零。