Neural Networks Efficiently Learn Low-Dimensional Representations with SGD

We study the problem of training a two-layer neural network (NN) of arbitrary width using stochastic gradient descent (SGD) where the input $\boldsymbol{x}\in \mathbb{R}^d$ is Gaussian and the target $y \in \mathbb{R}$ follows a multiple-index model, i.e., $y=g(\langle\boldsymbol{u_1},\boldsymbol{x}\rangle,...,\langle\boldsymbol{u_k},\boldsymbol{x}\rangle)$ with a noisy link function $g$. We prove that the first-layer weights of the NN converge to the $k$-dimensional principal subspace spanned by the vectors $\boldsymbol{u_1},...,\boldsymbol{u_k}$ of the true model, when online SGD with weight decay is used for training. This phenomenon has several important consequences when $k \ll d$. First, by employing uniform convergence on this smaller subspace, we establish a generalization error bound of $O(\sqrt{{kd}/{T}})$ after $T$ iterations of SGD, which is independent of the width of the NN. We further demonstrate that, SGD-trained ReLU NNs can learn a single-index target of the form $y=f(\langle\boldsymbol{u},\boldsymbol{x}\rangle) + \epsilon$ by recovering the principal direction, with a sample complexity linear in $d$ (up to log factors), where $f$ is a monotonic function with at most polynomial growth, and $\epsilon$ is the noise. This is in contrast to the known $d^{\Omega(p)}$ sample requirement to learn any degree $p$ polynomial in the kernel regime, and it shows that NNs trained with SGD can outperform the neural tangent kernel at initialization. Finally, we also provide compressibility guarantees for NNs using the approximate low-rank structure produced by SGD.

翻译：我们研究使用随机梯度下降（SGD）训练任意宽度的两层神经网络（NN）的问题，其中输入$\boldsymbol{x}\in \mathbb{R}^d$服从高斯分布，目标$y \in \mathbb{R}$遵循多指标模型，即$y=g(\langle\boldsymbol{u_1},\boldsymbol{x}\rangle,...,\langle\boldsymbol{u_k},\boldsymbol{x}\rangle)$，并带有噪声连接函数$g$。我们证明，当采用带权重衰减的在线SGD训练时，神经网络第一层权重收敛到由真实模型向量$\boldsymbol{u_1},...,\boldsymbol{u_k}$张成的$k$维主子空间。当$k \ll d$时，该现象具有若干重要结论。首先，通过利用该低维子空间上的一致收敛性，我们建立了SGD迭代$T$次后的泛化误差界$O(\sqrt{{kd}/{T}})$，该界限与神经网络宽度无关。进一步证明，经SGD训练的ReLU神经网络能够通过恢复主方向学习形如$y=f(\langle\boldsymbol{u},\boldsymbol{x}\rangle) + \epsilon$的单指标目标，其中$f是具有至多多项式增长率的单调函数，\epsilon$为噪声，其样本复杂度与$d$呈线性关系（忽略对数因子）。这与核机制下学习任意$p$次多项式所需的已知$d^{\Omega(p)}$样本量形成鲜明对比，表明经SGD训练的神经网络能够超越初始化的神经正切核。最后，我们利用SGD产生的近似低秩结构，为神经网络提供了可压缩性保证。