In recent years, signSGD has garnered interest as both a practical optimizer as well as a simple model to understand adaptive optimizers like Adam. Though there is a general consensus that signSGD acts to precondition optimization and reshapes noise, quantitatively understanding these effects in theoretically solvable settings remains difficult. We present an analysis of signSGD in a high dimensional limit, and derive a limiting SDE and ODE to describe the risk. Using this framework we quantify four effects of signSGD: effective learning rate, noise compression, diagonal preconditioning, and gradient noise reshaping. Our analysis is consistent with experimental observations but moves beyond that by quantifying the dependence of these effects on the data and noise distributions. We conclude with a conjecture on how these results might be extended to Adam.
翻译:近年来,signSGD 作为一种实用的优化器以及理解 Adam 等自适应优化器的简单模型,引起了广泛关注。尽管学术界普遍认为 signSGD 具有预处理优化和重塑噪声的作用,但在理论可解的场景中定量理解这些效应仍然存在困难。我们提出了 signSGD 在高维极限下的分析,并推导出描述风险的极限随机微分方程(SDE)和常微分方程(ODE)。利用这一框架,我们量化了 signSGD 的四种效应:有效学习率、噪声压缩、对角预处理和梯度噪声重塑。我们的分析与实验观察一致,并进一步量化了这些效应对数据和噪声分布的依赖关系。最后,我们提出了一项关于如何将这些结果推广至 Adam 的猜想。