Momentum-based optimizers are widely adopted for training neural networks. However, the optimal selection of momentum coefficients remains elusive. This uncertainty impedes a clear understanding of the role of momentum in stochastic gradient methods. In this paper, we present a frequency domain analysis framework that interprets the momentum method as a time-variant filter for gradients, where adjustments to momentum coefficients modify the filter characteristics. Our experiments support this perspective and provide a deeper understanding of the mechanism involved. Moreover, our analysis reveals the following significant findings: high-frequency gradient components are undesired in the late stages of training; preserving the original gradient in the early stages, and gradually amplifying low-frequency gradient components during training both enhance generalization performance. Based on these insights, we propose Frequency Stochastic Gradient Descent with Momentum (FSGDM), a heuristic optimizer that dynamically adjusts the momentum filtering characteristic with an empirically effective dynamic magnitude response. Experimental results demonstrate the superiority of FSGDM over conventional momentum optimizers.
翻译:基于动量的优化器在神经网络训练中被广泛采用。然而,动量系数的最优选择仍然难以确定。这种不确定性阻碍了对动量在随机梯度方法中作用的清晰理解。本文提出了一种频域分析框架,将动量方法解释为对梯度进行时变滤波的过程,其中对动量系数的调整会改变滤波器的特性。我们的实验支持这一观点,并提供了对相关机制的更深入理解。此外,我们的分析揭示了以下重要发现:训练后期阶段的高频梯度成分是不利的;在训练早期保持原始梯度,并在训练过程中逐渐放大低频梯度成分,均能提升泛化性能。基于这些见解,我们提出了带动量的频率随机梯度下降(FSGDM),这是一种启发式优化器,它通过经验有效的动态幅度响应来动态调整动量滤波特性。实验结果表明,FSGDM优于传统的动量优化器。