We study the use of linear regression for multiclass classification in the over-parametrized regime where some of the training data is mislabeled. In such scenarios it is necessary to add an explicit regularization term, $\lambda f(w)$, for some convex function $f(\cdot)$, to avoid overfitting the mislabeled data. In our analysis, we assume that the data is sampled from a Gaussian Mixture Model with equal class sizes, and that a proportion $c$ of the training labels is corrupted for each class. Under these assumptions, we prove that the best classification performance is achieved when $f(\cdot) = \|\cdot\|^2_2$ and $\lambda \to \infty$. We then proceed to analyze the classification errors for $f(\cdot) = \|\cdot\|_1$ and $f(\cdot) = \|\cdot\|_\infty$ in the large $\lambda$ regime and notice that it is often possible to find sparse and one-bit solutions, respectively, that perform almost as well as the one corresponding to $f(\cdot) = \|\cdot\|_2^2$.
翻译:我们研究过参数化场景下线性回归用于多类分类的问题,其中部分训练数据标记错误。在此类场景中,必须引入显式正则化项 $\lambda f(w)$(其中 $f(\cdot)$ 为凸函数)以避免对错误标记数据的过拟合。分析中,我们假设数据采样自类别大小相等的高斯混合模型,且每个类别的训练标签中有比例 $c$ 被破坏。在此假设下,我们证明当 $f(\cdot) = \|\cdot\|^2_2$ 且 $\lambda \to \infty$ 时分类性能最优。随后,我们进一步分析 $f(\cdot) = \|\cdot\|_1$ 与 $f(\cdot) = \|\cdot\|_\infty$ 在大 $\lambda$ 场景下的分类误差,并发现通常可以分别获得稀疏解与一位量化解,其表现几乎与 $f(\cdot) = \|\cdot\|_2^2$ 对应的解相当。