State-of-the-art neural networks require extreme computational power to train. It is therefore natural to wonder whether they are optimally trained. Here we apply a recent advancement in stochastic thermodynamics which allows bounding the speed at which one can go from the initial weight distribution to the final distribution of the fully trained network, based on the ratio of their Wasserstein-2 distance and the entropy production rate of the dynamical process connecting them. Considering both gradient-flow and Langevin training dynamics, we provide analytical expressions for these speed limits for linear and linearizable neural networks e.g. Neural Tangent Kernel (NTK). Remarkably, given some plausible scaling assumptions on the NTK spectra and spectral decomposition of the labels -- learning is optimal in a scaling sense. Our results are consistent with small-scale experiments with Convolutional Neural Networks (CNNs) and Fully Connected Neural networks (FCNs) on CIFAR-10, showing a short highly non-optimal regime followed by a longer optimal regime.
翻译:最先进的神经网络需要极大的计算能力进行训练。因此,自然产生疑问:它们是否得到了最优训练?本文应用随机热力学的最新进展,该进展基于连接初始权重分布与完全训练网络最终分布的动力学过程的Wasserstein-2距离与其熵产生率之比,对从初始分布到最终分布的速度进行约束。通过考虑梯度流和朗之万训练动力学,我们为线性及可线性化神经网络(例如神经正切核(NTK))提供了这些速度极限的解析表达式。值得注意的是,在关于NTK谱和标签谱分解的合理缩放假设下——学习在缩放意义上是最优的。我们的结果与CIFAR-10上使用卷积神经网络(CNN)和全连接神经网络(FCN)的小规模实验一致,显示出一个短期的、高度非最优的阶段,随后是一个较长的最优阶段。