The generalization error of max-margin linear classifiers: Benign overfitting and high dimensional asymptotics in the overparametrized regime

Modern machine learning classifiers often exhibit vanishing classification error on the training set. They achieve this by learning nonlinear representations of the inputs that maps the data into linearly separable classes. Motivated by these phenomena, we revisit high-dimensional maximum margin classification for linearly separable data. We consider a stylized setting in which data $(y_i,{\boldsymbol x}_i)$, $i\le n$ are i.i.d. with ${\boldsymbol x}_i\sim\mathsf{N}({\boldsymbol 0},{\boldsymbol \Sigma})$ a $p$-dimensional Gaussian feature vector, and $y_i \in\{+1,-1\}$ a label whose distribution depends on a linear combination of the covariates $\langle {\boldsymbol \theta}_*,{\boldsymbol x}_i \rangle$. While the Gaussian model might appear extremely simplistic, universality arguments can be used to show that the results derived in this setting also apply to the output of certain nonlinear featurization maps. We consider the proportional asymptotics $n,p\to\infty$ with $p/n\to \psi$, and derive exact expressions for the limiting generalization error. We use this theory to derive two results of independent interest: $(i)$ Sufficient conditions on $({\boldsymbol \Sigma},{\boldsymbol \theta}_*)$ for `benign overfitting' that parallel previously derived conditions in the case of linear regression; $(ii)$ An asymptotically exact expression for the generalization error when max-margin classification is used in conjunction with feature vectors produced by random one-layer neural networks.

翻译：现代机器学习分类器通常在训练集上表现出趋近于零的分类误差。它们通过学习输入的非线性表示来实现这一点，这种表示将数据映射为线性可分的类别。受这些现象启发，我们重新审视线性可分数据的高维最大间隔分类问题。考虑一个典型设定：数据$(y_i,{\boldsymbol x}_i)$（$i\le n$）独立同分布，其中${\boldsymbol x}_i\sim\mathsf{N}({\boldsymbol 0},{\boldsymbol \Sigma})$为$p$维高斯特征向量，标签$y_i \in\{+1,-1\}$的分布依赖于协变量的线性组合$\langle {\boldsymbol \theta}_*,{\boldsymbol x}_i \rangle$。尽管高斯模型看似极度简化，但可通过普适性论证表明，该设定下推导的结果同样适用于某些非线性特征映射的输出。我们考虑比例渐近$n,p\to\infty$且$p/n\to \psi$的情形，推导出极限泛化误差的精确表达式。基于该理论，我们得出两个具有独立意义的结果：$(i)$ 关于$({\boldsymbol \Sigma},{\boldsymbol \theta}_*)$的“良性过拟合”充分条件，该条件平行于线性回归中先前推导的条件；$(ii)$ 当最大间隔分类与随机单层神经网络生成的特征向量结合使用时，泛化误差的渐近精确表达式。

相关内容

泛化误差

关注 107

学习方法的泛化能力（Generalization Error）是由该方法学习到的模型对未知数据的预测能力，是学习方法本质上重要的性质。现实中采用最多的办法是通过测试泛化误差来评价学习方法的泛化能力。泛化误差界刻画了学习算法的经验风险与期望风险之间偏差和收敛速度。一个机器学习的泛化误差（Generalization Error），是一个描述学生机器在从样品数据中学习之后，离教师机器之间的差距的函数。

【干货书】数据分析优化，Optimization for Modern Data Analysis，117页pdf

专知会员服务

66+阅读 · 2023年2月15日

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

专知会员服务

138+阅读 · 2022年2月6日

INRIA最新「机器学习理论」新书，229页pdf原理性阐述机器学习

专知会员服务

69+阅读 · 2021年3月27日

【经典书】线性代数，436页pdf

专知会员服务

79+阅读 · 2021年3月16日