Neural network training is usually accomplished by solving a non-convex optimization problem using stochastic gradient descent. Although one optimizes over the networks parameters, the main loss function generally only depends on the realization of the neural network, i.e. the function it computes. Studying the optimization problem over the space of realizations opens up new ways to understand neural network training. In particular, usual loss functions like mean squared error and categorical cross entropy are convex on spaces of neural network realizations, which themselves are non-convex. Approximation capabilities of neural networks can be used to deal with the latter non-convexity, which allows us to establish that for sufficiently large networks local minima of a regularized optimization problem on the realization space are almost optimal. Note, however, that each realization has many different, possibly degenerate, parametrizations. In particular, a local minimum in the parametrization space needs not correspond to a local minimum in the realization space. To establish such a connection, inverse stability of the realization map is required, meaning that proximity of realizations must imply proximity of corresponding parametrizations. We present pathologies which prevent inverse stability in general, and, for shallow networks, proceed to establish a restricted space of parametrizations on which we have inverse stability w.r.t. to a Sobolev norm. Furthermore, we show that by optimizing over such restricted sets, it is still possible to learn any function which can be learned by optimization over unrestricted sets.
翻译:神经网络训练通常通过随机梯度下降求解非凸优化问题来完成。尽管优化过程针对网络参数进行,但主要损失函数通常仅依赖于神经网络的"实现",即其计算的函数。在实现空间上研究优化问题为理解神经网络训练提供了新途径。特别地,均方误差和分类交叉熵等常见损失函数在神经网络实现空间上是凸的,但实现空间本身是非凸的。神经网络的逼近能力可用于处理后者的非凸性,这使我们能够证明:对于足够大的网络,实现空间上正则化优化问题的局部最小值几乎是最优的。然而,需注意每个实现具有许多不同(可能退化)的参数化形式。特别地,参数化空间中的局部最小值不必然对应实现空间中的局部最小值。为建立这种联系,需要实现映射的逆稳定性,即实现的接近性必须蕴含对应参数化的接近性。我们揭示了普遍阻碍逆稳定性的病态情况,并针对浅层网络,在特定受限参数化空间上建立了关于Sobolev范数的逆稳定性。此外,我们证明通过在这种受限集上进行优化,仍能学习到通过无限制集优化所能学习的任何函数。