Universality of max-margin classifiers

Maximum margin binary classification is one of the most fundamental algorithms in machine learning, yet the role of featurization maps and the high-dimensional asymptotics of the misclassification error for non-Gaussian features are still poorly understood. We consider settings in which we observe binary labels $y_i$ and either $d$-dimensional covariates ${\boldsymbol z}_i$ that are mapped to a $p$-dimension space via a randomized featurization map ${\boldsymbol \phi}:\mathbb{R}^d \to\mathbb{R}^p$, or $p$-dimensional features of non-Gaussian independent entries. In this context, we study two fundamental questions: $(i)$ At what overparametrization ratio $p/n$ do the data become linearly separable? $(ii)$ What is the generalization error of the max-margin classifier? Working in the high-dimensional regime in which the number of features $p$, the number of samples $n$ and the input dimension $d$ (in the nonlinear featurization setting) diverge, with ratios of order one, we prove a universality result establishing that the asymptotic behavior is completely determined by the expected covariance of feature vectors and by the covariance between features and labels. In particular, the overparametrization threshold and generalization error can be computed within a simpler Gaussian model. The main technical challenge lies in the fact that max-margin is not the maximizer (or minimizer) of an empirical average, but the maximizer of a minimum over the samples. We address this by representing the classifier as an average over support vectors. Crucially, we find that in high dimensions, the support vector count is proportional to the number of samples, which ultimately yields universality.

翻译：最大间隔二分类是机器学习中最基础的算法之一，然而特征映射的作用以及非高斯特征下误分类误差的高维渐近行为仍缺乏深入理解。我们研究以下两种场景：观测到二分类标签 $y_i$ 以及 (i) 通过随机特征映射 ${\boldsymbol \phi}:\mathbb{R}^d \to\mathbb{R}^p$ 将 $d$ 维协变量 ${\boldsymbol z}_i$ 映射到 $p$ 维空间，或 (ii) 具有非高斯独立分量的 $p$ 维特征。在此框架下，我们探讨两个核心问题：$(i)$ 在何种过参数化比率 $p/n$ 下数据变得线性可分？$(ii)$ 最大间隔分类器的泛化误差是多少？通过分析特征数 $p$、样本数 $n$ 与输入维度 $d$（在非线性特征映射场景下）均趋于无穷、且比率保持一阶量级的高维机制，我们证明了一个普适性结论：渐近行为完全由特征向量的期望协方差以及特征与标签之间的协方差决定。特别地，过参数化阈值与泛化误差可在更简单的高斯模型中计算。主要技术挑战在于最大间隔并非经验平均的最大化（或最小化）量，而是样本最小值的最大化量。我们通过将分类器表示为支持向量的平均来解决此问题。关键发现是，在高维条件下，支持向量数量与样本数成比例，这最终导致了普适性。