Despite deep neural networks' powerful representation learning capabilities, theoretical understanding of how networks can simultaneously achieve meaningful feature learning and global convergence remains elusive. Existing approaches like the neural tangent kernel (NTK) are limited because features stay close to their initialization in this parametrization, leaving open questions about feature properties during substantial evolution. In this paper, we investigate the training dynamics of infinitely wide, $L$-layer neural networks using the tensor program (TP) framework. Specifically, we show that, when trained with stochastic gradient descent (SGD) under the Maximal Update parametrization ($\mu$P) and mild conditions on the activation function, SGD enables these networks to learn linearly independent features that substantially deviate from their initial values. This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum. Our analysis leverages both the interactions among features across layers and the properties of Gaussian random variables, providing new insights into deep representation learning. We further validate our theoretical findings through experiments on real-world datasets.
翻译:尽管深度神经网络具备强大的表示学习能力,关于网络如何同时实现有意义的特征学习与全局收敛的理论理解仍然不足。现有方法如神经正切核(NTK)存在局限性,因为在此参数化下特征保持接近初始化状态,使得关于显著演化过程中特征特性的问题悬而未决。本文基于张量程序(TP)框架研究了无限宽$L$层神经网络的训练动力学。具体而言,我们证明在最大更新参数化($\mu$P)和激活函数温和条件下使用随机梯度下降(SGD)训练时,SGD能使这些网络学习到与初始值显著偏离的线性无关特征。这种丰富的特征空间能够捕捉相关数据信息,并确保训练过程的任何收敛点都是全局最小值。我们的分析同时利用了跨层特征间的相互作用与高斯随机变量的性质,为深度表示学习提供了新的理论洞见。我们通过在真实数据集上的实验进一步验证了理论发现。