We analyze the dynamics of finite width effects in wide but finite feature learning neural networks. Unlike many prior analyses, our results, while perturbative in width, are non-perturbative in the strength of feature learning. Starting from a dynamical mean field theory (DMFT) description of infinite width deep neural network kernel and prediction dynamics, we provide a characterization of the $\mathcal{O}(1/\sqrt{\text{width}})$ fluctuations of the DMFT order parameters over random initialization of the network weights. In the lazy limit of network training, all kernels are random but static in time and the prediction variance has a universal form. However, in the rich, feature learning regime, the fluctuations of the kernels and predictions are dynamically coupled with variance that can be computed self-consistently. In two layer networks, we show how feature learning can dynamically reduce the variance of the final NTK and final network predictions. We also show how initialization variance can slow down online learning in wide but finite networks. In deeper networks, kernel variance can dramatically accumulate through subsequent layers at large feature learning strengths, but feature learning continues to improve the SNR of the feature kernels. In discrete time, we demonstrate that large learning rate phenomena such as edge of stability effects can be well captured by infinite width dynamics and that initialization variance can decrease dynamically. For CNNs trained on CIFAR-10, we empirically find significant corrections to both the bias and variance of network dynamics due to finite width.
翻译:我们分析了宽但有限的特征学习神经网络中有限宽度效应的动力学。与许多先前分析不同,我们的结果虽然在宽度上是微扰的,但在特征学习强度上是非微扰的。从无限宽度深度神经网络核与预测动力学的动态平均场理论(DMFT)描述出发,我们刻画了DMFT序参量在随机初始化网络权重时$\mathcal{O}(1/\sqrt{\text{宽度}})$的波动。在网络训练的懒惰极限中,所有核都是随机的但随时间静态,且预测方差具有普适形式。然而,在丰富的特征学习机制中,核与预测的波动动态耦合,其方差可自洽计算。对于两层网络,我们展示了特征学习如何动态地降低最终NTK和最终网络预测的方差。同时,我们揭示了初始化方差如何减慢宽但有限网络中的在线学习。对于更深网络,在强特征学习强度下,核方差会通过后续层显著积累,但特征学习持续改善特征核的信噪比。在离散时间中,我们证明了大学习率现象(如稳定边界效应)能被无限宽度动力学很好地捕捉,且初始化方差会动态减小。对于在CIFAR-10上训练的CNN,我们经验性地发现有限宽度对网络动力学的偏差和方差均带来显著修正。