Modern GPUs are equipped with large amounts of high-bandwidth memory, enabling them to support mini-batch sizes of up to tens of thousands of training samples. However, most existing optimizers struggle to perform effectively at such a large batch size. As batch size increases, gradient noise decreases due to averaging over many samples, limiting the ability of first-order methods to escape sharp or suboptimal minima and reach the global minimum. Meanwhile, second-order methods like the natural gradient with Kronecker-Factored Approximate Curvature (KFAC) often require excessively high damping to remain stable at large batch sizes. This high damping effectively washes out the curvature information that gives these methods their advantage, reducing their performance to that of simple gradient descent. In this paper, we introduce Fisher-Orthogonal Projection (FOP), a novel technique that restores the effectiveness of the second-order method at very large batch sizes, enabling scalable training with improved generalization and faster convergence. FOP constructs a variance-aware update direction by leveraging gradients from two sub-batches, enhancing the average gradient with a component of the gradient difference that is orthogonal to the average under the Fisher-metric.
翻译:现代GPU配备了海量高带宽内存,可支持高达数万训练样本的小批量规模。然而,现有优化器大多难以在此等大批量条件下有效运行。随着批量增大,梯度噪声因多样本平均而减弱,限制了一阶方法逃离尖锐或次优极小值并抵达全局极小的能力。与此同时,采用克罗内克分解近似曲率(KFAC)的自然梯度等二阶方法通常需要极高的阻尼才能在大批量时保持稳定。这种高阻尼实质上会消除赋予此类方法优势的曲率信息,使其性能退化为普通梯度下降。本文提出费希尔正交投影(FOP)这一创新技术,可在超大批量条件下恢复二阶方法的有效性,实现具有更优泛化能力与更快收敛速度的可扩展训练。FOP通过利用两个子批量的梯度构建方差感知的更新方向,在费希尔度量下用与平均梯度正交的梯度差分量来增强平均梯度。