Stochastic gradient descent and other first-order variants, such as Adam and AdaGrad, are commonly used in the field of deep learning due to their computational efficiency and low-storage memory requirements. However, these methods do not exploit curvature information. Consequently, iterates can converge to saddle points or poor local minima. On the other hand, Quasi-Newton methods compute Hessian approximations which exploit this information with a comparable computational budget. Quasi-Newton methods re-use previously computed iterates and gradients to compute a low-rank structured update. The most widely used quasi-Newton update is the L-BFGS, which guarantees a positive semi-definite Hessian approximation, making it suitable in a line search setting. However, the loss functions in DNNs are non-convex, where the Hessian is potentially non-positive definite. In this paper, we propose using a limited-memory symmetric rank-one quasi-Newton approach which allows for indefinite Hessian approximations, enabling directions of negative curvature to be exploited. Furthermore, we use a modified adaptive regularized cubics approach, which generates a sequence of cubic subproblems that have closed-form solutions with suitable regularization choices. We investigate the performance of our proposed method on autoencoders and feed-forward neural network models and compare our approach to state-of-the-art first-order adaptive stochastic methods as well as other quasi-Newton methods.x
翻译:随机梯度下降及其它一阶变体方法(如Adam和AdaGrad)因其计算效率高和内存需求低的特点,在深度学习领域得到广泛应用。然而,这些方法未能利用曲率信息,导致迭代可能收敛到鞍点或不良局部极小值。另一方面,拟牛顿方法通过计算海森矩阵近似值,在相当的计算预算下充分利用了曲率信息。这类方法通过复用先前计算的迭代值和梯度来执行低秩结构化更新。目前最广泛使用的拟牛顿更新是L-BFGS方法,它能保证海森矩阵近似保持半正定性,使其适用于线搜索场景。但深度神经网络中的损失函数具有非凸特性,其海森矩阵可能非正定。本文提出采用有限内存对称秩一拟牛顿方法,该方法允许海森矩阵出现不定近似,从而能够利用负曲率方向。此外,我们采用改进的自适应正则化立方方法,该方法通过选择适当的正则化参数,生成具有闭式解的一系列立方子问题。我们在自编码器和前馈神经网络模型上验证了所提方法的性能,并与最先进的一阶自适应随机方法及其他拟牛顿方法进行了对比分析。