We study the generalization properties of unregularized gradient methods applied to separable linear classification -- a setting that has received considerable attention since the pioneering work of Soudry et al. (2018). We establish tight upper and lower (population) risk bounds for gradient descent in this setting, for any smooth loss function, expressed in terms of its tail decay rate. Our bounds take the form $\Theta(r_{\ell,T}^2 / \gamma^2 T + r_{\ell,T}^2 / \gamma^2 n)$, where $T$ is the number of gradient steps, $n$ is size of the training set, $\gamma$ is the data margin, and $r_{\ell,T}$ is a complexity term that depends on the (tail decay rate) of the loss function (and on $T$). Our upper bound matches the best known upper bounds due to Shamir (2021); Schliserman and Koren (2022), while extending their applicability to virtually any smooth loss function and relaxing technical assumptions they impose. Our risk lower bounds are the first in this context and establish the tightness of our upper bounds for any given tail decay rate and in all parameter regimes. The proof technique used to show these results is also markedly simpler compared to previous work, and is straightforward to extend to other gradient methods; we illustrate this by providing analogous results for Stochastic Gradient Descent.
翻译:我们研究了无正则化梯度方法在可分离线性分类中的泛化性质——这一设定自Soudry等人(2018)的开创性工作以来受到广泛关注。针对该设定中的梯度下降方法,我们建立了任意光滑损失函数下(人口)风险的上界和下界紧致界限,这些界限用损失函数的尾部衰减率表示。我们的界限形如$\Theta(r_{\ell,T}^2 / \gamma^2 T + r_{\ell,T}^2 / \gamma^2 n)$,其中$T$为梯度步数,$n$为训练集规模,$\gamma$为数据间隔,$r_{\ell,T}$为依赖于损失函数(尾部衰减率)及$T$的复杂度项。我们的上界匹配了Shamir (2021)与Schliserman和Koren (2022)已知的最佳上界,同时将其适用范围扩展至几乎任意光滑损失函数,并放宽了他们施加的技术假设。本文提出的风险下界是该领域的首个结果,证明了给定任意尾部衰减率及所有参数区间下我们上界的紧致性。证明方法相较先前工作显著简化,且易于推广至其他梯度方法:我们通过为随机梯度下降提供类似结果展示了这一点。