Accurate Hessian spectra of foundation models have remained out of reach, leading most prior work to rely on small models or strong structural approximations. We show that faithful spectral analysis of the true Hessian is tractable at frontier scale. Using shard-local finite-difference Hessian vector products compatible with Fully Sharded Data Parallelism, we perform stochastic Lanczos quadrature on open-source language models with up to 100B parameters, producing the first large-scale spectral density estimates beyond the sub-10B regime. We characterize the numerical behavior of this pipeline, including finite-difference bias, floating-point noise amplification, and their effect on Krylov stability in fp32 and bf16, and derive practical operating regimes that are validated empirically. We further provide end-to-end runtime and memory scaling laws, showing that full-operator spectral probing incurs only a modest constant-factor overhead over first-order training. Crucially, direct access to the Hessian reveals that widely used block-diagonal curvature approximations can fail catastrophically, exhibiting order-one relative error and poor directional alignment even in mid-scale LLMs. Together, our results demonstrate that foundation-model Hessian spectra are both computable and qualitatively misrepresented by prevailing approximations, opening the door to principled curvature-based analysis at scale.
翻译:基础模型的精确Hessian谱分析始终难以实现,导致先前大多数研究只能依赖小型模型或强结构近似。我们证明,对真实Hessian矩阵进行忠实的谱分析在尖端模型尺度上是可行的。通过采用与全分片数据并行兼容的分片局部有限差分Hessian向量积,我们对参数规模高达1000亿的开源语言模型执行随机Lanczos求积,首次在百亿参数级别以上实现了大规模谱密度估计。我们系统分析了该计算流程的数值特性,包括有限差分偏差、浮点噪声放大及其在fp32与bf16精度下对Krylov子空间稳定性的影响,并通过实证验证推导出实用的计算参数区间。我们进一步给出了端到端的运行时与内存缩放规律,表明全算子谱探测仅比一阶训练引入适度的常数级开销。关键的是,对Hessian矩阵的直接访问揭示:广泛使用的块对角曲率近似可能产生灾难性失效,即使在中等规模的LLM中也会出现量级为1的相对误差与较差的方向对齐度。综合而言,我们的研究结果表明:基础模型的Hessian谱不仅是可计算的,而且被当前主流近似方法严重误判,这为开展基于曲率的规模化理论分析开辟了新途径。