Stochastic Gradient Descent (SGD) with adaptive steps is now widely used for training deep neural networks. Most theoretical results assume access to unbiased gradient estimators, which is not the case in several recent deep learning and reinforcement learning applications that use Monte Carlo methods. This paper provides a comprehensive non-asymptotic analysis of SGD with biased gradients and adaptive steps for convex and non-convex smooth functions. Our study incorporates time-dependent bias and emphasizes the importance of controlling the bias and Mean Squared Error (MSE) of the gradient estimator. In particular, we establish that Adagrad and RMSProp with biased gradients converge to critical points for smooth non-convex functions at a rate similar to existing results in the literature for the unbiased case. Finally, we provide experimental results using Variational Autoenconders (VAE) that illustrate our convergence results and show how the effect of bias can be reduced by appropriate hyperparameter tuning.
翻译:自适应步长的随机梯度下降(SGD)现已广泛应用于深度神经网络的训练中。然而,大多数理论结果假设可获取无偏梯度估计器,但在近期利用蒙特卡洛方法的深度学习和强化学习应用中,这一假设往往不成立。本文针对凸函数和非凸光滑函数,对采用有偏梯度与自适应步长的SGD算法进行了全面的非渐近分析。研究考虑了时变偏差,并强调了控制梯度估计器的偏差与均方误差(MSE)的重要性。特别地,我们证明:在使用有偏梯度的情况下,Adagrad和RMSProp算法对光滑非凸函数的临界点收敛速度与现有文献中无偏情形下的结论一致。最后,我们通过变分自编码器(VAE)的实验结果验证了收敛理论,并展示了如何通过合理调整超参数来降低偏差的影响。