Stochastic gradient-based optimization is crucial to optimize neural networks. While popular approaches heuristically adapt the step size and direction by rescaling gradients, a more principled approach to improve optimizers requires second-order information. Such methods precondition the gradient using the objective's Hessian. Yet, computing the Hessian is usually expensive and effectively using second-order information in the stochastic gradient setting is non-trivial. We propose using Information-Theoretic Trust Region Optimization (arTuRO) for improved updates with uncertain second-order information. By modeling the network parameters as a Gaussian distribution and using a Kullback-Leibler divergence-based trust region, our approach takes bounded steps accounting for the objective's curvature and uncertainty in the parameters. Before each update, it solves the trust region problem for an optimal step size, resulting in a more stable and faster optimization process. We approximate the diagonal elements of the Hessian from stochastic gradients using a simple recursive least squares approach, constructing a model of the expected Hessian over time using only first-order information. We show that arTuRO combines the fast convergence of adaptive moment-based optimization with the generalization capabilities of SGD.
翻译:随机梯度优化是神经网络优化的关键。虽然主流方法通过启发式地调整梯度来适应步长和方向,但更严谨的优化方法需要二阶信息。这类方法利用目标函数的海森矩阵对梯度进行预处理。然而,计算海森矩阵通常代价高昂,且在随机梯度场景下有效使用二阶信息并非易事。我们提出基于信息论的信任区域优化(arTuRO),用于在二阶信息不确定时实现更优的更新。通过将网络参数建模为高斯分布,并利用基于Kullback-Leibler散度的信任区域,我们的方法能够在考虑目标曲率和参数不确定性的前提下采取有界步长。每次更新前,该方法通过求解信任区域问题确定最优步长,从而获得更稳定、更快速的优化过程。我们采用简单的递归最小二乘方法,仅利用一阶信息从随机梯度中近似海森矩阵的对角元素,构建随时间变化的期望海森矩阵模型。实验表明,arTuRO将自适应矩估计优化的快速收敛特性与SGD的泛化能力相结合。