We revisit the problem of learning a single neuron with ReLU activation under Gaussian input with square loss. We particularly focus on the over-parameterization setting where the student network has $n\ge 2$ neurons. We prove the global convergence of randomly initialized gradient descent with a $O\left(T^{-3}\right)$ rate. This is the first global convergence result for this problem beyond the exact-parameterization setting ($n=1$) in which the gradient descent enjoys an $\exp(-\Omega(T))$ rate. Perhaps surprisingly, we further present an $\Omega\left(T^{-3}\right)$ lower bound for randomly initialized gradient flow in the over-parameterization setting. These two bounds jointly give an exact characterization of the convergence rate and imply, for the first time, that over-parameterization can exponentially slow down the convergence rate. To prove the global convergence, we need to tackle the interactions among student neurons in the gradient descent dynamics, which are not present in the exact-parameterization case. We use a three-phase structure to analyze GD's dynamics. Along the way, we prove gradient descent automatically balances student neurons, and use this property to deal with the non-smoothness of the objective function. To prove the convergence rate lower bound, we construct a novel potential function that characterizes the pairwise distances between the student neurons (which cannot be done in the exact-parameterization case). We show this potential function converges slowly, which implies the slow convergence rate of the loss function.
翻译:我们重新研究了在高斯输入和平方损失下,使用ReLU激活函数学习单神经元的问题。我们特别关注学生网络包含$n\ge 2$个神经元的过度参数化设置。我们证明了随机初始化的梯度下降以$O\left(T^{-3}\right)$的速率全局收敛。这是该问题在精确参数化设置($n=1$,其中梯度下降以$\exp(-\Omega(T))$速率收敛)之外的首个全局收敛结果。令人惊讶的是,我们进一步提出了过度参数化设置中随机初始化梯度流的$\Omega\left(T^{-3}\right)$下界。这两个界限共同给出了收敛速率的精确刻画,并首次表明过度参数化可能使收敛速率指数级减慢。为了证明全局收敛,我们需要处理梯度下降动力学中学生神经元之间的相互作用,这在精确参数化情况下是不存在的。我们采用三阶段结构来分析梯度下降的动力学。在此过程中,我们证明了梯度下降会自动平衡学生神经元,并利用这一性质处理目标函数的非光滑性。为了证明收敛速率的下界,我们构建了一个新的势函数来刻画学生神经元之间的成对距离(这在精确参数化情况下无法实现)。我们证明该势函数收敛缓慢,这意味着损失函数的收敛速率也较慢。