We study robust linear regression in high-dimension, when both the dimension $d$ and the number of data points $n$ diverge with a fixed ratio $\alpha=n/d$, and study a data model that includes outliers. We provide exact asymptotics for the performances of the empirical risk minimisation (ERM) using $\ell_2$-regularised $\ell_2$, $\ell_1$, and Huber loss, which are the standard approach to such problems. We focus on two metrics for the performance: the generalisation error to similar datasets with outliers, and the estimation error of the original, unpolluted function. Our results are compared with the information theoretic Bayes-optimal estimation bound. For the generalization error, we find that optimally-regularised ERM is asymptotically consistent in the large sample complexity limit if one perform a simple calibration, and compute the rates of convergence. For the estimation error however, we show that due to a norm calibration mismatch, the consistency of the estimator requires an oracle estimate of the optimal norm, or the presence of a cross-validation set not corrupted by the outliers. We examine in detail how performance depends on the loss function and on the degree of outlier corruption in the training set and identify a region of parameters where the optimal performance of the Huber loss is identical to that of the $\ell_2$ loss, offering insights into the use cases of different loss functions.
翻译:我们研究高维场景下的鲁棒线性回归,其中维度$d$和数据点数量$n$以固定比例$\alpha=n/d$发散,并考虑包含离群点的数据模型。针对此类问题的标准方法——使用$\ell_2$正则化$\ell_2$损失、$\ell_1$损失和Huber损失的经验风险最小化(ERM),我们提供了其性能的精确渐近行为。我们聚焦于两种性能度量:对包含离群点的相似数据集的泛化误差,以及对原始未污染函数的估计误差。结果与信息论意义上的贝叶斯最优估计界进行了比较。对于泛化误差,我们发现若执行简单校准,最优正则化的ERM在大样本复杂度极限下是渐近一致的,并计算了收敛速率。然而对于估计误差,我们证明由于范数校准失配,估计量的一致性需要最优范数的先知估计,或存在未被离群点污染的交叉验证集。我们详细考察了性能如何依赖于损失函数及训练集中离群点污染程度,并确定了Huber损失最优性能与$\ell_2$损失完全一致的参数区域,为不同损失函数的应用场景提供了见解。