Injecting artificial noise into gradient descent (GD) is commonly employed to improve the performance of machine learning models. Usually, uncorrelated noise is used in such perturbed gradient descent (PGD) methods. It is, however, not known if this is optimal or whether other types of noise could provide better generalization performance. In this paper, we zoom in on the problem of correlating the perturbations of consecutive PGD steps. We consider a variety of objective functions for which we find that GD with anticorrelated perturbations ("Anti-PGD") generalizes significantly better than GD and standard (uncorrelated) PGD. To support these experimental findings, we also derive a theoretical analysis that demonstrates that Anti-PGD moves to wider minima, while GD and PGD remain stuck in suboptimal regions or even diverge. This new connection between anticorrelated noise and generalization opens the field to novel ways to exploit noise for training machine learning models.
翻译:将人工噪声注入梯度下降(GD)是提升机器学习模型性能的常用方法。通常,此类扰动梯度下降(PGD)方法采用不相关噪声。然而,尚不清楚这是否为最优选择,亦或其他类型的噪声能否带来更好的泛化性能。本文聚焦于连续PGD步骤中扰动的相关性研究。我们考虑了多种目标函数,发现采用反相关扰动的梯度下降(“反相关PGD”)在泛化能力上显著优于标准梯度下降和(不相关)PGD。为支撑实验结果,我们还进行了理论分析,证明反相关PGD趋向于更宽的极小值点,而标准GD和PGD则停滞在次优区域甚至发散。这一反相关噪声与泛化能力之间的新联系,为利用噪声训练机器学习模型开辟了新的探索途径。