Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD. Analogous study of adaptive gradient methods, such as RMSprop and Adam, has been challenging because there were no rigorously proven SDE approximations for these methods. This paper derives the SDE approximations for RMSprop and Adam, giving theoretical guarantees of their correctness as well as experimental validation of their applicability to common large-scaling vision and language settings. A key practical result is the derivation of a $\textit{square root scaling rule}$ to adjust the optimization hyperparameters of RMSprop and Adam when changing batch size, and its empirical validation in deep learning settings.
翻译:将随机梯度下降(SGD)近似为随机微分方程(SDE)的研究方法,使学者能够在保持SGD随机性的前提下,通过连续优化轨迹进行研究。然而,对RMSprop和Adam等自适应梯度方法的类似研究一直面临挑战,因为这些方法此前缺乏严格证明的SDE近似模型。本文推导出RMSprop和Adam的SDE近似模型,从理论上保证其正确性,并通过实验验证其在常见大规模视觉与语言场景中的适用性。关键实践成果是推导出$\textit{平方根缩放规则}$,用于在调整批量大小时优化RMSprop和Adam的超参数,并在深度学习场景中进行了实证验证。