Recent efforts to accelerate LLM pretraining have focused on computationally-efficient approximations that exploit second-order structure. This raises a key question for large-scale training: how much performance is forfeited by these approximations? To probe this question, we establish a practical upper bound on iteration complexity by applying full Gauss-Newton (GN) preconditioning to transformer models of up to 150M parameters. Our experiments show that full GN updates yield substantial gains over existing optimizers, achieving a 5.4x reduction in training iterations compared to strong baselines like SOAP and Muon. Furthermore, we find that a precise layerwise GN preconditioner, which ignores cross-layer information, nearly matches the performance of the full GN method. Collectively, our results suggest: (1) the GN approximation is highly effective for preconditioning, implying higher-order loss terms may not be critical for convergence speed; (2) the layerwise Hessian structure contains sufficient information to achieve most of these potential gains; and (3) a significant performance gap exists between current approximate methods and an idealized layerwise oracle.
翻译:近期加速大型语言模型预训练的研究集中于利用二阶结构的计算高效近似方法。这引发了一个关于大规模训练的关键问题:这些近似方法会损失多少性能?为探究此问题,我们通过对参数规模达1.5亿的Transformer模型应用完整高斯-牛顿预条件处理,确立了迭代复杂度的实际上限。实验表明,完整高斯-牛顿更新相比SOAP、Muon等强基线优化器能带来显著增益,训练迭代次数减少达5.4倍。此外,我们发现忽略层间信息的精确分层高斯-牛顿预条件器,其性能几乎与完整高斯-牛顿方法相当。综合结果表明:(1)高斯-牛顿近似对预条件处理极为有效,这意味着高阶损失项对收敛速度可能并不关键;(2)分层海森矩阵结构已包含足够信息以实现大部分潜在增益;(3)当前近似方法与理想化分层预条件器之间仍存在显著性能差距。