We revisit the problem of learning a single neuron with ReLU activation under Gaussian input with square loss. We particularly focus on the over-parameterization setting where the student network has $n\ge 2$ neurons. We prove the global convergence of randomly initialized gradient descent with a $O\left(T^{-3}\right)$ rate. This is the first global convergence result for this problem beyond the exact-parameterization setting ($n=1$) in which the gradient descent enjoys an $\exp(-\Omega(T))$ rate. Perhaps surprisingly, we further present an $\Omega\left(T^{-3}\right)$ lower bound for randomly initialized gradient flow in the over-parameterization setting. These two bounds jointly give an exact characterization of the convergence rate and imply, for the first time, that over-parameterization can exponentially slow down the convergence rate. To prove the global convergence, we need to tackle the interactions among student neurons in the gradient descent dynamics, which are not present in the exact-parameterization case. We use a three-phase structure to analyze GD's dynamics. Along the way, we prove gradient descent automatically balances student neurons, and use this property to deal with the non-smoothness of the objective function. To prove the convergence rate lower bound, we construct a novel potential function that characterizes the pairwise distances between the student neurons (which cannot be done in the exact-parameterization case). We show this potential function converges slowly, which implies the slow convergence rate of the loss function.
翻译:我们重新审视了在高斯输入和平方损失下使用ReLU激活函数学习单个神经元的问题。特别关注学生网络具有$n\ge 2$个神经元的过参数化设置。我们证明了随机初始化梯度下降以$O\left(T^{-3}\right)$速率全局收敛。这是该问题在精确参数化设置($n=1$,其中梯度下降以$\exp(-\Omega(T))$速率收敛)之外的首次全局收敛结果。令人惊讶的是,我们进一步给出了过参数化设置中随机初始化梯度流的$\Omega\left(T^{-3}\right)$下界。这两个界共同精确刻画了收敛速率,并首次表明过参数化可能指数级减慢收敛速率。为了证明全局收敛性,我们需要处理梯度下降动力学中学生神经元之间的相互作用,这在精确参数化情况下是不存在的。我们采用三阶段结构分析梯度下降的动力学。在此过程中,我们证明了梯度下降自动平衡学生神经元,并利用这一性质处理目标函数的非光滑性。为了证明收敛速率下界,我们构造了一个新颖的势函数来刻画学生神经元之间的成对距离(这在精确参数化情况下无法实现)。我们证明该势函数收敛缓慢,进而导致损失函数的慢收敛速率。