Second-order optimization algorithms exhibit excellent convergence properties for training deep learning models, but often incur significant computation and memory overheads. This can result in lower training efficiency than the first-order counterparts such as stochastic gradient descent (SGD). In this work, we present a memory- and time-efficient second-order algorithm named Eva with two novel techniques: 1) we construct the second-order information with the Kronecker factorization of small stochastic vectors over a mini-batch of training data to reduce memory consumption, and 2) we derive an efficient update formula without explicitly computing the inverse of matrices using the Sherman-Morrison formula. We further extend Eva to a general vectorized approximation framework to improve the compute and memory efficiency of two existing second-order algorithms (FOOF and Shampoo) without affecting their convergence performance. Extensive experimental results on different models and datasets show that Eva reduces the end-to-end training time up to 2.05x and 2.42x compared to first-order SGD and second-order algorithms (K-FAC and Shampoo), respectively.
翻译:摘要:二阶优化算法在训练深度学习模型时展现出优异的收敛性,但其计算与内存开销往往较高,导致训练效率低于随机梯度下降(SGD)等一阶方法。本文提出一种兼顾内存与时间效率的二阶算法Eva,其核心包含两项创新技术:1)基于训练数据小批量中随机小向量的Kronecker分解构建二阶信息,以降低内存消耗;2)利用Sherman-Morrison公式推导出无需显式计算矩阵逆的高效更新公式。我们进一步将Eva扩展为通用向量化近似框架,在不影响收敛性能的前提下提升现有两种二阶算法(FOOF和Shampoo)的计算与内存效率。在多种模型与数据集上的大量实验结果表明,与一阶SGD及二阶算法(K-FAC和Shampoo)相比,Eva分别将端到端训练时间缩短至原先的2.05倍和2.42倍以内。