According to a popular viewpoint, neural networks learn from data by first identifying low-dimensional representations, and subsequently fitting the best model in this space. Recent works provide a formalization of this phenomenon when learning multi-index models. In this setting, we are given $n$ i.i.d. pairs $({\boldsymbol x}_i,y_i)$, where the covariate vectors ${\boldsymbol x}_i\in\mathbb{R}^d$ are isotropic, and responses $y_i$ only depend on ${\boldsymbol x}_i$ through a $k$-dimensional projection ${\boldsymbol Θ}_*^{\sf T}{\boldsymbol x}_i$. Feature learning amounts to learning the latent space spanned by ${\boldsymbol Θ}_*$. In this context, we study the gradient descent dynamics of two-layer neural networks under the proportional asymptotics $n,d\to\infty$, $n/d\toδ$, while the dimension of the latent space $k$ and the number of hidden neurons $m$ are kept fixed. Earlier work establishes that feature learning via polynomial-time algorithms is possible if $δ> δ_{\text{alg}}$, for $δ_{\text{alg}}$ a threshold depending on the data distribution, and is impossible (within a certain class of algorithms) below $δ_{\text{alg}}$. Here we derive an analogous threshold $δ_{\text{NN}}$ for two-layer networks. Our characterization of $δ_{\text{NN}}$ opens the way to study the dependence of learning dynamics on the network architecture and training algorithm. The threshold $δ_{\text{NN}}$ is determined by the following scenario. Training first visits points for which the gradient of the empirical risk is large and learns the directions spanned by these gradients. Then the gradient becomes smaller and the dynamics becomes dominated by negative directions of the Hessian. The threshold $δ_{\text{NN}}$ corresponds to a phase transition in the spectrum of the Hessian in this second phase.
翻译:根据一种流行观点,神经网络通过首先识别低维表示,随后在该空间中拟合最优模型来从数据中学习。近期研究为学习多索引模型时的这一现象提供了形式化描述。在此设定下,我们获得$n$个独立同分布对$({\boldsymbol x}_i,y_i)$,其中协变量向量${\boldsymbol x}_i\in\mathbb{R}^d$是各向同性的,而响应$y_i$仅通过$k$维投影${\boldsymbol Θ}_*^{\sf T}{\boldsymbol x}_i$依赖于${\boldsymbol x}_i$。特征学习本质上等价于学习由${\boldsymbol Θ}_*$张成的潜在空间。在此背景下,我们研究在比例渐近条件$n,d\to\infty$,$n/d\toδ$下两层神经网络的梯度下降动力学,其中潜在空间维度$k$与隐藏神经元数量$m$保持固定。先前研究证明,若$δ> δ_{\text{alg}}$($δ_{\text{alg}}$为依赖于数据分布的阈值),则通过多项式时间算法实现特征学习是可行的;而在$δ_{\text{alg}}$以下(在特定算法类别内)则不可行。本文推导出两层网络的类似阈值$δ_{\text{NN}}$。我们对$δ_{\text{NN}}$的表征为研究学习动力学对网络架构与训练算法的依赖性开辟了途径。阈值$δ_{\text{NN}}$由以下场景决定:训练过程首先访问经验风险梯度较大的点,并学习这些梯度张成的方向;随后梯度变小,动力学开始由Hessian矩阵的负方向主导。该阈值$δ_{\text{NN}}$对应于第二阶段中Hessian矩阵谱的相变。