Despite the non-convex optimization landscape, over-parametrized shallow networks are able to achieve global convergence under gradient descent. The picture can be radically different for narrow networks, which tend to get stuck in badly-generalizing local minima. Here we investigate the cross-over between these two regimes in the high-dimensional setting, and in particular investigate the connection between the so-called mean-field/hydrodynamic regime and the seminal approach of Saad & Solla. Focusing on the case of Gaussian data, we study the interplay between the learning rate, the time scale, and the number of hidden units in the high-dimensional dynamics of stochastic gradient descent (SGD). Our work builds on a deterministic description of SGD in high-dimensions from statistical physics, which we extend and for which we provide rigorous convergence rates.
翻译:尽管非凸优化景观复杂,但过参数化的浅层网络能够在梯度下降下实现全局收敛。而对于窄网络而言,情况可能截然不同——这类网络往往陷入泛化能力差的局部极小值。本文旨在研究高维设定下这两种机制之间的交叉现象,尤其关注所谓平均场/流体动力学机制与Saad & Solla开创性方法之间的关联。我们以高斯数据为例,探讨学习率、时间尺度及隐藏单元数量在高维随机梯度下降动力学中的相互作用。本工作基于统计物理学中对高维SGD的确定性描述,我们对此进行了拓展,并给出了严格的收敛速率。