Understanding the curvature evolution of the loss landscape is fundamental to analyzing the training dynamics of neural networks. The most commonly studied measure, Hessian sharpness ($λ_{\max}^H$) -- the largest eigenvalue of the loss Hessian -- determines local training stability and interacts with the learning rate throughout training. Despite its significance in analyzing training dynamics, direct measurement of Hessian sharpness remains prohibitive for Large Language Models (LLMs) due to high computational cost. We analyze $\textit{critical sharpness}$ ($λ_c$), a computationally efficient measure requiring fewer than $10$ forward passes given the update direction $Δ\mathbfθ$. Critically, this measure captures well-documented Hessian sharpness phenomena, including progressive sharpening and Edge of Stability. Using this measure, we provide the first demonstration of these sharpness phenomena at scale, up to $7$B parameters, spanning both pre-training and mid-training of OLMo-2 models. We further introduce $\textit{relative critical sharpness}$ ($λ_c^{1\to 2}$), which quantifies the curvature of one loss landscape while optimizing another, to analyze the transition from pre-training to fine-tuning and guide data mixing strategies. Critical sharpness provides practitioners with a practical tool for diagnosing curvature dynamics and informing data composition choices at scale. More broadly, our work shows that scalable curvature measures can provide actionable insights for large-scale training.
翻译:理解损失景观的曲率演化是分析神经网络训练动态的基础。最常研究的度量——Hessian锐度($λ_{\max}^H$),即损失Hessian矩阵的最大特征值——决定了局部训练稳定性,并在整个训练过程中与学习率相互作用。尽管其在分析训练动态中具有重要意义,但由于计算成本高昂,直接测量Hessian锐度对于大型语言模型(LLMs)仍然不可行。我们分析了$\textit{临界锐度}$($λ_c$),这是一种计算高效的度量,在给定更新方向$Δ\mathbfθ$的情况下仅需少于$10$次前向传播。关键在于,该度量捕捉了已有充分记录的Hessian锐度现象,包括渐进锐化和稳定性边缘。利用这一度量,我们首次在大规模(高达$7$B参数)上展示了这些锐度现象,涵盖了OLMo-2模型的预训练和中期训练。我们进一步引入了$\textit{相对临界锐度}$($λ_c^{1\to 2}$),该度量在优化一个损失景观时量化另一个损失景观的曲率,用于分析从预训练到微调的过渡并指导数据混合策略。临界锐度为实践者提供了一个实用工具,用于诊断曲率动态并为大规模数据组合选择提供依据。更广泛地说,我们的工作表明,可扩展的曲率度量能够为大规模训练提供可操作的见解。