Despite their better convergence properties compared to first-order optimizers, second-order optimizers for deep learning have been less popular due to their significant computational costs. The primary efficiency bottleneck in such optimizers is matrix inverse calculations in the preconditioning step, which are expensive to compute on GPUs. In this paper, we introduce Jorge, a second-order optimizer that promises the best of both worlds -- rapid convergence benefits of second-order methods, and high computational efficiency typical of first-order methods. We address the primary computational bottleneck of computing matrix inverses by completely eliminating them using an approximation of the preconditioner computation. This makes Jorge extremely efficient on GPUs in terms of wall-clock time. Further, we describe an approach to determine Jorge's hyperparameters directly from a well-tuned SGD baseline, thereby significantly minimizing tuning efforts. Our empirical evaluations demonstrate the distinct advantages of using Jorge, outperforming state-of-the-art optimizers such as SGD, AdamW, and Shampoo across multiple deep learning models, both in terms of sample efficiency and wall-clock time.
翻译:尽管二阶优化器相比一阶优化器具有更优的收敛特性,但其在深度学习中的应用因显著的计算成本而相对较少。此类优化器的主要效率瓶颈在于预条件步骤中的矩阵求逆运算,这类运算在GPU上计算代价高昂。本文提出Jorge这一二阶优化器,它兼具两类方法的优势——二阶方法的快速收敛优势与一阶方法典型的高计算效率。我们通过完全消除预条件计算中的矩阵求逆运算(采用近似方法)来解决这一核心计算瓶颈,使Jorge在GPU上实现了极高的实时计算效率。此外,我们提出一种直接从良好调优的SGD基准中确定Jorge超参数的方法,从而显著减少调参工作量。实验评估表明,Jorge在多类深度学习模型上相较于SGD、AdamW和Shampoo等先进优化器展现出显著优势,体现在样本效率与实时计算时间两个方面。