We study the theory of neural network (NN) from the lens of classical nonparametric regression problems with a focus on NN's ability to adaptively estimate functions with heterogeneous smoothness -- a property of functions in Besov or Bounded Variation (BV) classes. Existing work on this problem requires tuning the NN architecture based on the function spaces and sample size. We consider a "Parallel NN" variant of deep ReLU networks and show that the standard $\ell_2$ regularization is equivalent to promoting the $\ell_p$-sparsity ($0<p<1$) in the coefficient vector of an end-to-end learned function bases, i.e., a dictionary. Using this equivalence, we further establish that by tuning only the regularization factor, such parallel NN achieves an estimation error arbitrarily close to the minimax rates for both the Besov and BV classes. Notably, it gets exponentially closer to minimax optimal as the NN gets deeper. Our research sheds new lights on why depth matters and how NNs are more powerful than kernel methods.
翻译:我们从经典非参数回归问题的角度研究神经网络理论,重点探讨网络对具有异质性光滑度函数(即Besov类或有界变差类函数的性质)的自适应估计能力。现有相关研究需要根据函数空间和样本量调整神经网络架构。我们考虑深度ReLU网络的"并行神经网络"变体,证明标准$\ell_2$正则化等价于在端到端学习函数基(即字典)的系数向量中促进$\ell_p$稀疏性($0<p<1$)。基于该等价性,我们进一步证实:仅通过调整正则化因子,此类并行神经网络在Besov类和BV类上均可实现任意接近极小极大最优的估计误差。值得注意的是,随着网络深度增加,其以指数速度逼近极小极大最优。本工作为理解深度为何重要以及神经网络为何优于核方法提供了新见解。