Depth separation results propose a possible theoretical explanation for the benefits of deep neural networks over shallower architectures, establishing that the former possess superior approximation capabilities. However, there are no known results in which the deeper architecture leverages this advantage into a provable optimization guarantee. We prove that when the data are generated by a distribution with radial symmetry which satisfies some mild assumptions, gradient descent can efficiently learn ball indicator functions using a depth 2 neural network with two layers of sigmoidal activations, and where the hidden layer is held fixed throughout training. By building on and refining existing techniques for approximation lower bounds of neural networks with a single layer of non-linearities, we show that there are $d$-dimensional radial distributions on the data such that ball indicators cannot be learned efficiently by any algorithm to accuracy better than $\Omega(d^{-4})$, nor by a standard gradient descent implementation to accuracy better than a constant. These results establish what is to the best of our knowledge, the first optimization-based separations where the approximation benefits of the stronger architecture provably manifest in practice. Our proof technique introduces new tools and ideas that may be of independent interest in the theoretical study of both the approximation and optimization of neural networks.
翻译:深度分离结果为深度神经网络相比浅层架构的优势提供了可能的理论解释,证明前者具有更优的逼近能力。然而,目前尚未有已知结果能证明更深层架构可将这一优势转化为可证明的优化保证。我们证明:当数据服从满足某些温和假设的径向对称分布时,梯度下降能够利用包含两层sigmoid激活函数的深度2神经网络高效学习球指示函数,且训练过程中隐藏层保持固定。通过构建并改进现有针对单层非线性神经网络逼近下界的分析方法,我们证明存在数据上的$d$维径向分布,使得任何算法对球指示函数的学习精度均无法优于$\Omega(d^{-4})$,而标准梯度下降实现的精度则无法超越常数级。这些结果建立了据我们所知首个基于优化的分离结论——其中更强架构的逼近优势在实践中被可证明地体现。我们的证明技术引入了可能对神经网络逼近与优化的理论研究中具有独立价值的新工具与新思路。