In the field of deep learning, Stochastic Gradient Descent (SGD) and its momentum-based variants are the predominant choices for optimization algorithms. Despite all that, these momentum strategies, which accumulate historical gradients by using a fixed $\beta$ hyperparameter to smooth the optimization processing, often neglect the potential impact of the variance of historical gradients on the current gradient estimation. In the gradient variance during training, fluctuation indicates the objective function does not meet the Lipschitz continuity condition at all time, which raises the troublesome optimization problem. This paper aims to explore the potential benefits of reducing the variance of historical gradients to make optimizer converge to flat solutions. Moreover, we proposed a new optimization method based on reducing the variance. We employed the Wiener filter theory to enhance the first moment estimation of SGD, notably introducing an adaptive weight to optimizer. Specifically, the adaptive weight dynamically changes along with temporal fluctuation of gradient variance during deep learning model training. Experimental results demonstrated our proposed adaptive weight optimizer, SGDF (Stochastic Gradient Descent With Filter), can achieve satisfactory performance compared with state-of-the-art optimizers.
翻译:在深度学习领域,随机梯度下降(SGD)及其基于动量的变体是优化算法的主流选择。然而,这些通过固定超参数β累积历史梯度以平滑优化过程的动量策略,往往忽略了历史梯度方差对当前梯度估计的潜在影响。训练过程中梯度方差的波动表明,目标函数并非始终满足Lipschitz连续性条件,这引发了棘手的优化难题。本文旨在探索降低历史梯度方差以使优化器收敛至平坦解的潜在益处。此外,我们提出了一种基于方差降低的新型优化方法。通过应用维纳滤波器理论增强SGD的一阶矩估计,我们在优化器中引入了自适应权重。具体而言,该自适应权重随深度学习模型训练过程中梯度方差的时间波动动态变化。实验结果表明,我们提出的自适应权重优化器SGDF(带滤波器的随机梯度下降)相比最先进的优化器能够取得令人满意的性能。