Learning Hierarchical Polynomials with Three-Layer Neural Networks

We study the problem of learning hierarchical polynomials over the standard Gaussian distribution with three-layer neural networks. We specifically consider target functions of the form $h = g \circ p$ where $p : \mathbb{R}^d \rightarrow \mathbb{R}$ is a degree $k$ polynomial and $g: \mathbb{R} \rightarrow \mathbb{R}$ is a degree $q$ polynomial. This function class generalizes the single-index model, which corresponds to $k=1$, and is a natural class of functions possessing an underlying hierarchical structure. Our main result shows that for a large subclass of degree $k$ polynomials $p$, a three-layer neural network trained via layerwise gradient descent on the square loss learns the target $h$ up to vanishing test error in $\widetilde{\mathcal{O}}(d^k)$ samples and polynomial time. This is a strict improvement over kernel methods, which require $\widetilde \Theta(d^{kq})$ samples, as well as existing guarantees for two-layer networks, which require the target function to be low-rank. Our result also generalizes prior works on three-layer neural networks, which were restricted to the case of $p$ being a quadratic. When $p$ is indeed a quadratic, we achieve the information-theoretically optimal sample complexity $\widetilde{\mathcal{O}}(d^2)$, which is an improvement over prior work~\citep{nichani2023provable} requiring a sample size of $\widetilde\Theta(d^4)$. Our proof proceeds by showing that during the initial stage of training the network performs feature learning to recover the feature $p$ with $\widetilde{\mathcal{O}}(d^k)$ samples. This work demonstrates the ability of three-layer neural networks to learn complex features and as a result, learn a broad class of hierarchical functions.

翻译：我们研究使用三层神经网络在标准高斯分布上学习分层多项式的问题。具体考虑形如 $h = g \circ p$ 的目标函数，其中 $p : \mathbb{R}^d \rightarrow \mathbb{R}$ 是 $k$ 次多项式，$g: \mathbb{R} \rightarrow \mathbb{R}$ 是 $q$ 次多项式。该函数类泛化了对应 $k=1$ 的单指标模型，是一类具有内在分层结构的自然函数类。我们的主要结果表明：对于一大类 $k$ 次多项式 $p$，通过平方损失上逐层梯度下降训练的三层神经网络，能够在 $\widetilde{\mathcal{O}}(d^k)$ 个样本及多项式时间内学习目标函数 $h$ 直至测试误差消失。这严格优于需要 $\widetilde \Theta(d^{kq})$ 个样本的核方法，以及要求目标函数为低秩的现有两层网络保证。该结果还推广了此前三层神经网络的研究工作，这些工作局限于 $p$ 为二次函数的情形。当 $p$ 确实为二次函数时，我们实现了信息论最优的样本复杂度 $\widetilde{\mathcal{O}}(d^2)$，相较于先前要求 $\widetilde\Theta(d^4)$ 样本量的工作~\citep{nichani2023provable} 有所改进。我们的证明思路是：在训练初始阶段，网络通过特征学习以 $\widetilde{\mathcal{O}}(d^k)$ 个样本恢复特征 $p$。该工作展示了三层神经网络学习复杂特征的能力，并因此能够学习一大类分层函数。

相关内容

Neural Networks

关注 1654

神经网络（Neural Networks）是世界上三个最古老的神经建模学会的档案期刊:国际神经网络学会(INNS)、欧洲神经网络学会(ENNS)和日本神经网络学会(JNNS)。神经网络提供了一个论坛，以发展和培育一个国际社会的学者和实践者感兴趣的所有方面的神经网络和相关方法的计算智能。神经网络欢迎高质量论文的提交，有助于全面的神经网络研究，从行为和大脑建模，学习算法，通过数学和计算分析，系统的工程和技术应用，大量使用神经网络的概念和技术。这一独特而广泛的范围促进了生物和技术研究之间的思想交流，并有助于促进对生物启发的计算智能感兴趣的跨学科社区的发展。因此，神经网络编委会代表的专家领域包括心理学，神经生物学，计算机科学，工程，数学，物理。该杂志发表文章、信件和评论以及给编辑的信件、社论、时事、软件调查和专利信息。文章发表在五个部分之一:认知科学，神经科学，学习系统，数学和计算分析、工程和应用。官网地址：http://dblp.uni-trier.de/db/journals/nn/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日