Depth separation -- why a deeper network is more powerful than a shallower one -- has been a major problem in deep learning theory. Previous results often focus on representation power. For example, arXiv:1904.06984 constructed a function that is easy to approximate using a 3-layer network but not approximable by any 2-layer network. In this paper, we show that this separation is in fact algorithmic: one can learn the function constructed by arXiv:1904.06984 using an overparameterized network with polynomially many neurons efficiently. Our result relies on a new way of extending the mean-field limit to multilayer networks, and a decomposition of loss that factors out the error introduced by the discretization of infinite-width mean-field networks.
翻译:深度分离——为何深层网络比浅层网络更具表达能力——一直是深度学习理论中的核心问题。现有研究多聚焦于表征能力层面。例如,arXiv:1904.06984 构造了一个使用三层网络易于逼近、但任何两层网络均无法逼近的函数。本文证明,这种分离本质上是算法层面的:通过使用具有多项式级神经元规模的过参数化网络,我们可以高效地学习该函数。该结论依赖于两项创新:一是将平均场极限拓展至多层网络的新方法,二是将无限宽平均场网络离散化引入的误差从损失函数中分离的分解策略。