We investigate the high-dimensional properties of robust regression estimators in the presence of heavy-tailed contamination of both the covariates and response functions. In particular, we provide a sharp asymptotic characterisation of M-estimators trained on a family of elliptical covariate and noise data distributions including cases where second and higher moments do not exist. We show that, despite being consistent, the Huber loss with optimally tuned location parameter $\delta$ is suboptimal in the high-dimensional regime in the presence of heavy-tailed noise, highlighting the necessity of further regularisation to achieve optimal performance. This result also uncovers the existence of a transition in $\delta$ as a function of the sample complexity and contamination. Moreover, we derive the decay rates for the excess risk of ridge regression. We show that, while it is both optimal and universal for covariate distributions with finite second moment, its decay rate can be considerably faster when the covariates' second moment does not exist. Finally, we show that our formulas readily generalise to a richer family of models and data distributions, such as generalised linear estimation with arbitrary convex regularisation trained on mixture models.
翻译:本研究探讨了在协变量和响应函数均存在重尾污染情况下稳健回归估计量的高维性质。具体而言,我们对基于椭圆分布族(包括二阶及更高阶矩不存在的情形)的协变量与噪声数据分布训练的M估计量给出了精确的渐近刻画。研究表明,尽管具有一致性,但在存在重尾噪声的高维体系中,具有最优调谐位置参数$\delta$的Huber损失函数是次优的,这凸显了为实现最优性能而进行进一步正则化的必要性。该结果同时揭示了$\delta$作为样本复杂度与污染程度的函数存在相变现象。此外,我们推导了岭回归超额风险的衰减速率。结果表明,虽然对于具有有限二阶矩的协变量分布而言,岭回归既是最优的也是普适的,但当协变量的二阶矩不存在时,其衰减速率可显著加快。最后,我们证明了所得公式可自然推广至更丰富的模型与数据分布族,例如基于混合模型训练、采用任意凸正则化的广义线性估计。