This study explores the sample complexity for two-layer neural networks to learn a generalized linear target function under Stochastic Gradient Descent (SGD), focusing on the challenging regime where many flat directions are present at initialization. It is well-established that in this scenario $n=O(d \log d)$ samples are typically needed. However, we provide precise results concerning the pre-factors in high-dimensional contexts and for varying widths. Notably, our findings suggest that overparameterization can only enhance convergence by a constant factor within this problem class. These insights are grounded in the reduction of SGD dynamics to a stochastic process in lower dimensions, where escaping mediocrity equates to calculating an exit time. Yet, we demonstrate that a deterministic approximation of this process adequately represents the escape time, implying that the role of stochasticity may be minimal in this scenario.
翻译:本研究探讨了在随机梯度下降(SGD)下,两层神经网络学习广义线性目标函数的样本复杂度,重点关注初始化时存在大量平坦方向的挑战性场景。众所周知,在此情况下通常需要 $n=O(d \log d)$ 个样本。然而,我们提供了关于高维背景下预因子及不同宽度下的精确结果。值得注意的是,我们的发现表明,在此问题类别中,过参数化仅能将收敛速度提升一个常数因子。这些见解基于将 SGD 动力学简化为低维随机过程,其中逃离平庸等价于计算逃逸时间。然而,我们证明该过程的确定性近似足以表征逃逸时间,这意味着随机性在此场景中的作用可能极小。