We analyze the dynamics of finite width effects in wide but finite feature learning neural networks. Starting from a dynamical mean field theory description of infinite width deep neural network kernel and prediction dynamics, we provide a characterization of the $\mathcal{O}(1/\sqrt{\text{width}})$ fluctuations of the DMFT order parameters over random initializations of the network weights. Our results, while perturbative in width, unlike prior analyses, are non-perturbative in the strength of feature learning. In the lazy limit of network training, all kernels are random but static in time and the prediction variance has a universal form. However, in the rich, feature learning regime, the fluctuations of the kernels and predictions are dynamically coupled with a variance that can be computed self-consistently. In two layer networks, we show how feature learning can dynamically reduce the variance of the final tangent kernel and final network predictions. We also show how initialization variance can slow down online learning in wide but finite networks. In deeper networks, kernel variance can dramatically accumulate through subsequent layers at large feature learning strengths, but feature learning continues to improve the signal-to-noise ratio of the feature kernels. In discrete time, we demonstrate that large learning rate phenomena such as edge of stability effects can be well captured by infinite width dynamics and that initialization variance can decrease dynamically. For CNNs trained on CIFAR-10, we empirically find significant corrections to both the bias and variance of network dynamics due to finite width.
翻译:我们分析了宽但有限特征学习神经网络中有限宽度效应的动态特性。从无限宽度深度神经网络核与预测动态的平均场理论描述出发,我们刻画了网络权重随机初始化下DMFT序参数的$\mathcal{O}(1/\sqrt{\text{宽度}})$涨落。与以往分析不同,我们的结果虽在宽度上是微扰的,但在特征学习强度上却非微扰。在网络的懒惰极限中,所有核是随机的但随时间静态,且预测方差具有普适形式。然而,在丰富特征学习机制中,核与预测的涨落动态耦合,其方差可自洽计算。在两层网络中,我们展示了特征学习如何动态降低最终切向核与网络预测的方差。我们还展示了初始化方差如何减慢宽但有限网络中的在线学习。在更深网络中,大的特征学习强度下,核方差可通过后续层急剧累积,但特征学习持续改善特征核的信噪比。在离散时间中,我们证明了大学习率现象(如边缘稳定性效应)可被无限宽度动态良好捕捉,且初始化方差可动态减小。对于在CIFAR-10上训练的CNN,我们经验性地发现了有限宽度对网络动态偏差与方差的重要修正。