Injecting artificial noise into gradient descent (GD) is commonly employed to improve the performance of machine learning models. Usually, uncorrelated noise is used in such perturbed gradient descent (PGD) methods. It is, however, not known if this is optimal or whether other types of noise could provide better generalization performance. In this paper, we zoom in on the problem of correlating the perturbations of consecutive PGD steps. We consider a variety of objective functions for which we find that GD with anticorrelated perturbations ("Anti-PGD") generalizes significantly better than GD and standard (uncorrelated) PGD. To support these experimental findings, we also derive a theoretical analysis that demonstrates that Anti-PGD moves to wider minima, while GD and PGD remain stuck in suboptimal regions or even diverge. This new connection between anticorrelated noise and generalization opens the field to novel ways to exploit noise for training machine learning models.
翻译:在梯度下降中注入人工噪声是提升机器学习模型性能的常用方法。通常,这类扰动梯度下降方法采用不相关噪声。然而,目前尚不清楚这种方法是否最优,亦或其他类型的噪声能否带来更优的泛化性能。本文聚焦于连续扰动梯度下降步骤中扰动之间的相关性。针对多种目标函数,我们发现采用反相关扰动的梯度下降方法在泛化性能上显著优于标准梯度下降及标准不相关扰动梯度下降。为支撑实验发现,我们进一步从理论层面分析表明:反相关扰动梯度下降趋向于收敛到更宽的极值点,而标准梯度下降与不相关扰动梯度下降则陷入次优区域甚至发散。反相关噪声与泛化之间的这种新联系,为利用噪声训练机器学习模型开辟了全新途径。