This paper introduces a new regularized version of the robust $\tau$-regression estimator for analyzing high-dimensional datasets subject to gross contamination in the response variables and covariates (explanatory variables). The resulting estimator, termed adaptive $\tau$-Lasso, is robust to outliers and high-leverage points. It also incorporates an adaptive $\ell_1$-norm penalty term, which enables the selection of relevant variables and reduces the bias associated with large true regression coefficients. More specifically, this adaptive $\ell_1$-norm penalty term assigns a weight to each regression coefficient. For a fixed number of predictors $p$, we show that the adaptive $\tau$-Lasso has the oracle property, ensuring both variable-selection consistency and asymptotic normality. Asymptotic normality applies only to the entries of the regression vector corresponding to the true support, assuming knowledge of the true regression vector support. We characterize its robustness via the finite-sample breakdown point and the influence function. We carry out extensive simulations and observe that the class of $\tau$-Lasso estimators exhibits robustness and reliable performance in both contaminated and uncontaminated data settings. We also validate our theoretical findings on robustness properties through simulation experiments. In the face of outliers and high-leverage points, the adaptive $\tau$-Lasso and $\tau$-Lasso estimators achieve the best performance or close-to-best performance in terms of prediction and variable selection accuracy compared to other competing regularized estimators for all scenarios considered in this study. Therefore, the adaptive $\tau$-Lasso and $\tau$-Lasso estimators can be effectively employed for a variety of sparse linear regression problems, particularly in high-dimensional settings and when the data is contaminated by outliers and high-leverage points.
翻译:本文针对响应变量和协变量(解释变量)中存在严重污染的高维数据集,提出了一种新型正则化鲁棒$τ$-回归估计量。该估计量被称为自适应$τ$-Lasso,对异常值和高杠杆点具有鲁棒性。同时,它引入自适应$\ell_1$范数惩罚项,能够选择相关变量并降低较大真实回归系数带来的偏差。具体而言,该自适应$\ell_1$范数惩罚项为每个回归系数分配权重。在预测变量数$p$固定时,我们证明自适应$τ$-Lasso具有Oracle性质,确保变量选择一致性和渐近正态性。渐近正态性仅适用于真实支撑对应的回归向量分量,且需假设已知真实回归向量支撑。我们通过有限样本崩溃点和影响函数刻画其鲁棒性。大量仿真实验表明,$τ$-Lasso类估计量在受污染和未污染数据场景中均表现出鲁棒性和可靠性能。我们通过模拟实验进一步验证了鲁棒性方面的理论结果。面对异常值和高杠杆点时,与本研究所有场景中对比的其他正则化估计量相比,自适应$τ$-Lasso和$τ$-Lasso估计量在预测精度和变量选择准确度上均达到最优或接近最优性能。因此,自适应$τ$-Lasso和$τ$-Lasso估计量可有效应用于各类稀疏线性回归问题,尤其适用于高维数据及数据受异常值和高杠杆点污染的情形。