Training billion-parameter Transformers is often brittle, with transient loss spikes and divergence that waste compute. Even though the recently developed Edge of Stability (EoS) theory provides a powerful tool to understand and control the stability of optimization methods via the (preconditioned) curvature, these curvature-controlling methods are not popular in large-scale Transformer training due to the complexity of curvature estimation. To this end, we first introduce a fast online estimator of the largest (preconditioned) Hessian eigenvalue (i.e., curvature) based on a warm-started variant for power iteration with Hessian-vector products. We show theoretically, and verify empirically, that the proposed method makes per-iteration curvature tracking feasible at billion parameter scale while being more accurate. Using this tool, we find that training instabilities coincide with surges in preconditioned curvature and that curvature grows with depth. Motivated by these observations, we propose architecture warm-up: progressively growing network depth to carefully control the preconditioned Hessian and stabilize training. Experiments on large Transformers validate that our approach enables efficient curvature tracking and reduces instabilities compared to existing state-of-the-art stabilization techniques without slowing down convergence.
翻译:训练十亿级参数的Transformer往往具有脆弱性,伴随瞬态损失尖峰与发散现象,造成计算资源浪费。尽管近期发展的稳定性边界理论(Edge of Stability, EoS)为通过(预处理)曲率理解并控制优化方法的稳定性提供了强大工具,但由于曲率估计的复杂性,这些曲率控制方法在大规模Transformer训练中尚未得到广泛采用。为此,我们首先提出一种基于热启动变体与Hessian向量积的幂迭代算法,实现对最大(预处理)Hessian特征值(即曲率)的快速在线估计。我们从理论上证明并通过实验验证:所提方法在十亿级参数规模下可实现每迭代步的曲率追踪,且精度更高。利用该工具,我们发现训练不稳定性与预处理曲率的激增存在关联,且曲率随网络深度增加而增长。基于这些观察,我们提出架构预热策略:通过渐进式增加网络深度来精细控制预处理Hessian矩阵,从而稳定训练。在大规模Transformer上的实验表明,与现有最先进稳定性技术相比,本方法在不减缓收敛速度的前提下,实现了高效的曲率追踪并显著降低了训练不稳定性。