Sharp Generalization for Nonparametric Regression in Interpolation Space by Over-Parameterized Neural Networks Trained with Preconditioned Gradient Descent and Early Stopping

Networking · Neural Networks · 泛函 · 核化 · 早停 ·

2025 年 10 月 5 日

翻译：过参数化神经网络通过预条件梯度下降与早停法在插值空间中进行非参数回归的锐利泛化界

Yingzhen Yang,Ping Li

We study nonparametric regression using an over-parameterized two-layer neural networks trained with algorithmic guarantees in this paper. We consider the setting where the training features are drawn uniformly from the unit sphere in $\RR^d$, and the target function lies in an interpolation space commonly studied in statistical learning theory. We demonstrate that training the neural network with a novel Preconditioned Gradient Descent (PGD) algorithm, equipped with early stopping, achieves a sharp regression rate of $\cO(n^{-\frac{2\alpha s'}{2\alpha s'+1}})$ when the target function is in the interpolation space $\bth{\cH_K}^{s'}$ with $s' \ge 3$. This rate is even sharper than the currently known nearly-optimal rate of $\cO(n^{-\frac{2\alpha s'}{2\alpha s'+1}})\log^2(1/\delta)$~\citep{Li2024-edr-general-domain}, where $n$ is the size of the training data and $\delta \in (0,1)$ is a small probability. This rate is also sharper than the standard kernel regression rate of $\cO(n^{-\frac{2\alpha}{2\alpha+1}})$ obtained under the regular Neural Tangent Kernel (NTK) regime when training the neural network with the vanilla gradient descent (GD), where $2\alpha = d/(d-1)$. Our analysis is based on two key technical contributions. First, we present a principled decomposition of the network output at each PGD step into a function in the reproducing kernel Hilbert space (RKHS) of a newly induced integral kernel, and a residual function with small $L^{\infty}$-norm. Second, leveraging this decomposition, we apply local Rademacher complexity theory to tightly control the complexity of the function class comprising all the neural network functions obtained in the PGD iterates. Our results further suggest that PGD enables the neural network to escape the linear NTK regime and achieve improved generalization.

翻译：本文研究使用过参数化双层神经网络进行非参数回归，并提供了算法保证。我们考虑训练特征从 $d$ 维单位球面上均匀采样，且目标函数位于统计学习理论中常见的插值空间中的设定。我们证明，当目标函数处于插值空间 $\bth{\cH_K}^{s'}$（其中 $s' \ge 3$）时，采用配备早停策略的新型预条件梯度下降算法训练神经网络，可实现 $\cO(n^{-\frac{2\alpha s'}{2\alpha s'+1}})$ 的锐利回归率。该速率甚至比目前已知的近乎最优速率 $\cO(n^{-\frac{2\alpha s'}{2\alpha s'+1}})\log^2(1/\delta)$~\citep{Li2024-edr-general-domain} 更为锐利，其中 $n$ 为训练数据规模，$\delta \in (0,1)$ 为一小概率。该速率也优于在标准神经正切核（NTK）机制下，使用原始梯度下降法训练神经网络所获得的常规核回归速率 $\cO(n^{-\frac{2\alpha}{2\alpha+1}})$，其中 $2\alpha = d/(d-1)$。我们的分析基于两项关键的技术贡献。首先，我们提出了一种原则性的分解方法，将预条件梯度下降每一步的网络输出分解为：一个属于新引入积分核的再生核希尔伯特空间（RKHS）的函数，以及一个具有较小 $L^{\infty}$-范数的残差函数。其次，利用该分解，我们应用局部Rademacher复杂性理论来严格控制在预条件梯度下降迭代中获得的所有神经网络函数所构成函数类的复杂性。我们的结果进一步表明，预条件梯度下降能使神经网络逃离线性NTK机制，从而实现更好的泛化性能。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日