Symmetric Rank-One Quasi-Newton Methods for Deep Learning Using Cubic Regularization

Stochastic gradient descent and other first-order variants, such as Adam and AdaGrad, are commonly used in the field of deep learning due to their computational efficiency and low-storage memory requirements. However, these methods do not exploit curvature information. Consequently, iterates can converge to saddle points or poor local minima. On the other hand, Quasi-Newton methods compute Hessian approximations which exploit this information with a comparable computational budget. Quasi-Newton methods re-use previously computed iterates and gradients to compute a low-rank structured update. The most widely used quasi-Newton update is the L-BFGS, which guarantees a positive semi-definite Hessian approximation, making it suitable in a line search setting. However, the loss functions in DNNs are non-convex, where the Hessian is potentially non-positive definite. In this paper, we propose using a limited-memory symmetric rank-one quasi-Newton approach which allows for indefinite Hessian approximations, enabling directions of negative curvature to be exploited. Furthermore, we use a modified adaptive regularized cubics approach, which generates a sequence of cubic subproblems that have closed-form solutions with suitable regularization choices. We investigate the performance of our proposed method on autoencoders and feed-forward neural network models and compare our approach to state-of-the-art first-order adaptive stochastic methods as well as other quasi-Newton methods.x

翻译：随机梯度下降及其它一阶变体方法（如Adam和AdaGrad）因其计算效率高和内存需求低的特点，在深度学习领域得到广泛应用。然而，这些方法未能利用曲率信息，导致迭代可能收敛到鞍点或不良局部极小值。另一方面，拟牛顿方法通过计算海森矩阵近似值，在相当的计算预算下充分利用了曲率信息。这类方法通过复用先前计算的迭代值和梯度来执行低秩结构化更新。目前最广泛使用的拟牛顿更新是L-BFGS方法，它能保证海森矩阵近似保持半正定性，使其适用于线搜索场景。但深度神经网络中的损失函数具有非凸特性，其海森矩阵可能非正定。本文提出采用有限内存对称秩一拟牛顿方法，该方法允许海森矩阵出现不定近似，从而能够利用负曲率方向。此外，我们采用改进的自适应正则化立方方法，该方法通过选择适当的正则化参数，生成具有闭式解的一系列立方子问题。我们在自编码器和前馈神经网络模型上验证了所提方法的性能，并与最先进的一阶自适应随机方法及其他拟牛顿方法进行了对比分析。

相关内容

拟牛顿法

关注 1

拟牛顿法(Quasi-Newton Methods)是求解非线性优化问题最有效的方法之一，于20世纪50年代由美国Argonne国家实验室的物理学家W. C. Davidon所提出来。Davidon设计的这种算法在当时看来是非线性优化领域最具创造性的发明之一。不久R. Fletcher和M. J. D. Powell证实了这种新的算法远比其他方法快速和可靠，使得非线性优化这门学科在一夜之间突飞猛进。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日