In this paper, we study the loss landscape of one-hidden-layer neural networks with ReLU-like activation functions trained with the empirical squared loss using gradient descent (GD). We identify the stationary points of such networks, which significantly slow down loss decrease during training. To capture such points while accounting for the non-differentiability of the loss, the stationary points that we study are directional stationary points, rather than other notions like Clarke stationary points. We show that, if a stationary point does not contain "escape neurons", which are defined with first-order conditions, it must be a local minimum. Moreover, for the scalar-output case, the presence of an escape neuron guarantees that the stationary point is not a local minimum. Our results refine the description of the saddle-to-saddle training process starting from infinitesimally small (vanishing) initialization for shallow ReLU-like networks: By precluding the saddle escape types that previous works did not rule out, we advance one step closer to a complete picture of the entire dynamics. Moreover, we are also able to fully discuss how network embedding, which is to instantiate a narrower network with a wider network, reshapes the stationary points.
翻译:本文研究使用梯度下降法训练具有ReLU类激活函数的单隐藏层神经网络在经验平方损失下的损失景观。我们识别了此类网络中的驻点,这些驻点会显著减缓训练过程中的损失下降速度。为捕捉此类驻点并处理损失函数的不可微特性,我们所研究的驻点是方向驻点,而非Clarke驻点等其他概念。我们证明,若驻点不包含由一阶条件定义的"逃逸神经元",则该驻点必为局部极小值。此外,对于标量输出情形,逃逸神经元的存在保证了该驻点不是局部极小值。我们的结果完善了浅层ReLU类网络从无穷小初始化开始的鞍点到鞍点训练过程描述:通过排除先前工作未能排除的鞍点逃逸类型,我们向完整刻画整个动力学过程迈进了一步。此外,我们还全面探讨了网络嵌入(即将较窄网络实例化为较宽网络)如何重塑驻点结构。