Regularized linear regression is a promising approach for binary classification problems in which the training set has noisy labels since the regularization term can help to avoid interpolating the mislabeled data points. In this paper we provide a systematic study of the effects of the regularization strength on the performance of linear classifiers that are trained to solve binary classification problems by minimizing a regularized least-squares objective. We consider the over-parametrized regime and assume that the classes are generated from a Gaussian Mixture Model (GMM) where a fraction $c<\frac{1}{2}$ of the training data is mislabeled. Under these assumptions, we rigorously analyze the classification errors resulting from the application of ridge, $\ell_1$, and $\ell_\infty$ regression. In particular, we demonstrate that ridge regression invariably improves the classification error. We prove that $\ell_1$ regularization induces sparsity and observe that in many cases one can sparsify the solution by up to two orders of magnitude without any considerable loss of performance, even though the GMM has no underlying sparsity structure. For $\ell_\infty$ regularization we show that, for large enough regularization strength, the optimal weights concentrate around two values of opposite sign. We observe that in many cases the corresponding "compression" of each weight to a single bit leads to very little loss in performance. These latter observations can have significant practical ramifications.
翻译:正则化线性回归是解决训练集存在噪声标签的二分类问题的一种有前景的方法,因为正则化项有助于避免对错误标记数据点的过拟合。本文系统研究了正则化强度对通过最小化正则化最小二乘目标函数训练的二分类线性分类器性能的影响。我们考虑过参数化场景,并假设类别由高斯混合模型(GMM)生成,其中训练数据中占比为$c<\frac{1}{2}$的部分被错误标记。基于这些假设,我们严格分析了应用岭回归、$\ell_1$和$\ell_\infty$回归所产生的分类误差。具体而言,我们证明岭回归总能改善分类误差;证明$\ell_1$正则化会诱导稀疏性,并观察到在许多情况下,即使GMM本身不具有稀疏结构,仍可将解稀疏化至两个数量级而性能无明显损失。对于$\ell_\infty$正则化,我们表明当正则化强度足够大时,最优权重会集中于两个相反符号的数值附近。我们发现,在许多情况下,将每个权重“压缩”为单个比特几乎不会导致性能下降。后两项观察结果具有重要的实际意义。