Understanding the curvature evolution of the loss landscape is fundamental to analyzing the training dynamics of neural networks. The most commonly studied measure, Hessian sharpness ($λ_{\max}^H$) -- the largest eigenvalue of the loss Hessian -- determines local training stability and interacts with the learning rate throughout training. Despite its significance in analyzing training dynamics, direct measurement of Hessian sharpness remains prohibitive for Large Language Models (LLMs) due to high computational cost. We analyze $\textit{critical sharpness}$ ($λ_c$), a computationally efficient measure requiring fewer than $10$ forward passes given the update direction $Δ\mathbfθ$. Critically, this measure captures well-documented Hessian sharpness phenomena, including progressive sharpening and Edge of Stability. Using this measure, we provide the first demonstration of these sharpness phenomena at scale, up to $7$B parameters, spanning both pre-training and mid-training of OLMo-2 models. We further introduce $\textit{relative critical sharpness}$ ($λ_c^{1\to 2}$), which quantifies the curvature of one loss landscape while optimizing another, to analyze the transition from pre-training to fine-tuning and guide data mixing strategies. Critical sharpness provides practitioners with a practical tool for diagnosing curvature dynamics and informing data composition choices at scale. More broadly, our work shows that scalable curvature measures can provide actionable insights for large-scale training.
翻译:理解损失景观的曲率演化对于分析神经网络的训练动态至关重要。最常研究的度量——Hessian锐度($λ_{\max}^H$),即损失Hessian矩阵的最大特征值——决定了局部训练稳定性,并在整个训练过程中与学习率相互作用。尽管Hessian锐度在分析训练动态中具有重要意义,但由于计算成本高昂,对大语言模型(LLMs)直接测量Hessian锐度仍然不可行。我们分析了$\textit{临界锐度}$($λ_c$),这是一种计算高效的度量,在给定更新方向$Δ\mathbfθ$的情况下,仅需不到$10$次前向传播即可计算。关键在于,该度量捕捉了已有充分文献记载的Hessian锐度现象,包括渐进锐化和稳定性边缘。利用这一度量,我们首次在大规模(参数规模高达$7$B)上演示了这些锐度现象,涵盖了OLMo-2模型的预训练和中期训练阶段。我们进一步引入了$\textit{相对临界锐度}$($λ_c^{1\to 2}$),它量化了在优化一个损失景观时另一个损失景观的曲率,用于分析从预训练到微调的过渡,并指导数据混合策略。临界锐度为实践者提供了一个实用的工具,用于诊断大规模训练中的曲率动态并为数据构成选择提供依据。更广泛地说,我们的工作表明,可扩展的曲率度量可以为大规模训练提供可操作的见解。