Optimization is often cast as a deterministic problem, where the solution is found through some iterative procedure such as gradient descent. However, when training neural networks the loss function changes over (iteration) time due to the randomized selection of a subset of the samples. This randomization turns the optimization problem into a stochastic one. We propose to consider the loss as a noisy observation with respect to some reference optimum. This interpretation of the loss allows us to adopt Kalman filtering as an optimizer, as its recursive formulation is designed to estimate unknown parameters from noisy measurements. Moreover, we show that the Kalman Filter dynamical model for the evolution of the unknown parameters can be used to capture the gradient dynamics of advanced methods such as Momentum and Adam. We call this stochastic optimization method KOALA, which is short for Kalman Optimization Algorithm with Loss Adaptivity. KOALA is an easy to implement, scalable, and efficient method to train neural networks. We provide convergence analysis and show experimentally that it yields parameter estimates that are on par with or better than existing state of the art optimization algorithms across several neural network architectures and machine learning tasks, such as computer vision and language modeling.
翻译:优化问题通常被表述为确定性问题,其解通过梯度下降等迭代过程求得。然而,在训练神经网络时,由于随机选择样本子集,损失函数会随(迭代)时间而变化。这种随机化将优化问题转化为随机问题。我们建议将损失视为相对于某个参考最优值的噪声观测。这种对损失的解读使我们能够采用卡尔曼滤波作为优化器,因为其递归公式正是为从噪声测量中估计未知参数而设计的。此外,我们证明了用于描述未知参数演化的卡尔曼滤波器动力学模型,可用于捕捉如动量和Adam等先进方法的梯度动力学。我们将这种随机优化方法称为KOALA,即具有损失自适应性的卡尔曼优化算法。KOALA是一种易于实现、可扩展且高效的神经网络训练方法。我们提供了收敛性分析,并通过实验证明,在多种神经网络架构和机器学习任务(如计算机视觉和语言建模)中,其参数估计效果与现有最先进优化算法相当或更优。