We study the use of linear regression for multiclass classification in the over-parametrized regime where some of the training data is mislabeled. In such scenarios it is necessary to add an explicit regularization term, $\lambda f(w)$, for some convex function $f(\cdot)$, to avoid overfitting the mislabeled data. In our analysis, we assume that the data is sampled from a Gaussian Mixture Model with equal class sizes, and that a proportion $c$ of the training labels is corrupted for each class. Under these assumptions, we prove that the best classification performance is achieved when $f(\cdot) = \|\cdot\|^2_2$ and $\lambda \to \infty$. We then proceed to analyze the classification errors for $f(\cdot) = \|\cdot\|_1$ and $f(\cdot) = \|\cdot\|_\infty$ in the large $\lambda$ regime and notice that it is often possible to find sparse and one-bit solutions, respectively, that perform almost as well as the one corresponding to $f(\cdot) = \|\cdot\|_2^2$.
翻译:本研究探讨了在过参数化且部分训练数据被错误标注的情况下,使用线性回归进行多类分类的问题。在此类场景中,为避免对错误标注数据的过拟合,需要添加显式正则化项 $\lambda f(w)$,其中 $f(\cdot)$ 为凸函数。在我们的分析中,假设数据采样自具有等类别规模的混合高斯模型,且每个类别中有比例为 $c$ 的训练标签被破坏。基于这些假设,我们证明了当 $f(\cdot) = \|\cdot\|^2_2$ 且 $\lambda \to \infty$ 时,分类性能达到最优。随后,我们分析了在较大 $\lambda$ 条件下 $f(\cdot) = \|\cdot\|_1$ 和 $f(\cdot) = \|\cdot\|_\infty$ 对应的分类误差,发现通常可以分别找到稀疏解和一位解,其性能几乎与 $f(\cdot) = \|\cdot\|_2^2$ 对应的解相当。