Past research has indicated that the covariance of the Stochastic Gradient Descent (SGD) error done via minibatching plays a critical role in determining its regularization and escape from low potential points. Motivated by some new research in this area, we prove universality results by showing that noise classes that have the same mean and covariance structure of SGD via minibatching have similar properties. We mainly consider the Multiplicative Stochastic Gradient Descent (M-SGD) algorithm as introduced in previous work, which has a much more general noise class than the SGD algorithm done via minibatching. We establish non asymptotic bounds for the M-SGD algorithm in the Wasserstein distance. We also show that the M-SGD error is approximately a scaled Gaussian distribution with mean $0$ at any fixed point of the M-SGD algorithm.
翻译:过去研究表明,通过小批量处理实现的随机梯度下降(SGD)误差的协方差在决定其正则化效果及逃离低势能点方面起着关键作用。受该领域新研究的启发,我们证明了噪声类的普适性结果,表明具有与小批量SGD相同均值和协方差结构的噪声类具有相似性质。我们主要考虑先前工作中引入的乘法随机梯度下降(M-SGD)算法,该算法拥有比小批量SGD更一般的噪声类。我们在Wasserstein距离下建立了M-SGD算法的非渐近界。同时证明,在M-SGD算法的任意不动点处,其误差近似为均值$0$的缩放高斯分布。