We study the evolution of hidden-weight spectra in wide neural networks trained by (stochastic) gradient descent. We develop a two-level dynamical mean-field theory (DMFT) that jointly tracks bulk and outlier spectral dynamics for spiked ensembles whose spike directions remain statistically dependent on the random bulk. We apply this framework to two settings: (1) infinite-width nonlinear networks in mean-field/$μ$P scaling and (2) deep linear networks in the proportional high-dimensional limit, where width, input dimension, and sample size diverge with fixed ratios. Our theory predicts how outliers evolve with training time, width, output scale, and initialization variance. In deep linear networks, $μ$P yields width-consistent outlier dynamics and hyperparameter transfer, including width-stable growth of the leading NTK mode toward the edge of stability (EoS). In contrast, NTK parameterization exhibits strongly width-dependent outlier dynamics, despite converging to a stable large-width limit. We show that this bulk+outlier picture is descriptive of simple tasks with small output channels, but that tasks involving large numbers of outputs (ImageNet classification or GPT language modeling) are better described by a restructuring of the spectral bulk. We develop a toy model with extensive output channels that recapitulates this phenomenon and show that edge of the spectrum still converges for sufficiently wide networks.
翻译:我们研究由(随机)梯度下降训练的宽神经网络中隐藏权重的谱演化。提出一种双层动力学平均场理论(DMFT),可联合追踪尖峰系综(其尖峰方向与随机背景保持统计依赖)的主体谱和离群谱动力学。将该框架应用于两类场景:(1)平均场/μP缩放下的无穷宽非线性网络;(2)比例高维极限中的深度线性网络(宽度、输入维度和样本量以固定比例发散)。理论预测了离群值随训练时间、宽度、输出尺度和初始化方差的演化规律。在深度线性网络中,μP实现宽度一致的离群动力学与超参数迁移,包括主导NTK模式向稳定性边缘(EoS)的宽度稳定增长。相比之下,NTK参数化虽收敛至稳定的大宽度极限,却表现出强宽度依赖的离群动力学。研究表明,这种主体+离群图景适用于小输出通道的简单任务,但涉及大量输出(ImageNet分类或GPT语言建模)的任务更适合用谱主体的重构来描述。我们构建了具有广泛输出通道的玩具模型复现该现象,并证明对于足够宽的网络,谱边缘仍将收敛。