In the field of deep learning, Stochastic Gradient Descent (SGD) and its momentum-based variants are the predominant choices for optimization algorithms. Despite all that, these momentum strategies, which accumulate historical gradients by using a fixed $\beta$ hyperparameter to smooth the optimization processing, often neglect the potential impact of the variance of historical gradients on the current gradient estimation. In the gradient variance during training, fluctuation indicates the objective function does not meet the Lipschitz continuity condition at all time, which raises the troublesome optimization problem. This paper aims to explore the potential benefits of reducing the variance of historical gradients to make optimizer converge to flat solutions. Moreover, we proposed a new optimization method based on reducing the variance. We employed the Wiener filter theory to enhance the first moment estimation of SGD, notably introducing an adaptive weight to optimizer. Specifically, the adaptive weight dynamically changes along with temporal fluctuation of gradient variance during deep learning model training. Experimental results demonstrated our proposed adaptive weight optimizer, SGDF (Stochastic Gradient Descent With Filter), can achieve satisfactory performance compared with state-of-the-art optimizers.
翻译:在深度学习领域,随机梯度下降(SGD)及其基于动量的变体是优化算法的主要选择。然而,这些通过固定超参数β累积历史梯度以平滑优化过程的动量策略,常常忽略了历史梯度的方差对当前梯度估计的潜在影响。在训练过程中,梯度方差的变化表明目标函数并非始终满足Lipschitz连续性条件,这引发了棘手的优化问题。本文旨在探索通过降低历史梯度的方差来促使优化器收敛至平坦解的潜在优势。此外,我们提出了一种基于方差降低的新型优化方法,利用维纳滤波器理论增强SGD的一阶矩估计,特别地为优化器引入了自适应权重。具体而言,该自适应权重在深度学习模型训练过程中,会随着梯度方差的时间波动动态变化。实验结果表明,我们提出的自适应权重优化器SGDF(带滤波器的随机梯度下降)能够与当前最先进的优化器相媲美,取得令人满意的性能。