This paper introduces a new regularized version of the robust $\tau$-regression estimator for analyzing high-dimensional datasets subject to gross contamination in the response variables and covariates (explanatory variables). The resulting estimator, termed adaptive $\tau$-Lasso, is robust to outliers and high-leverage points. It also incorporates an adaptive $\ell_1$-norm penalty term, which enables the selection of relevant variables and reduces the bias associated with large true regression coefficients. More specifically, this adaptive $\ell_1$-norm penalty term assigns a weight to each regression coefficient. For a fixed number of predictors $p$, we show that the adaptive $\tau$-Lasso has the oracle property, ensuring both variable-selection consistency and asymptotic normality. Asymptotic normality applies only to the entries of the regression vector corresponding to the true support, assuming knowledge of the true regression vector support. We characterize its robustness by establishing the finite-sample breakdown point and the influence function. We carry out extensive simulations and observe that the class of $\tau$-Lasso estimators exhibits robustness and reliable performance in both contaminated and uncontaminated data settings. We also validate our theoretical findings on robustness properties through simulations. In the face of outliers and high-leverage points, the adaptive $\tau$-Lasso and $\tau$-Lasso estimators achieve the best performance or close-to-best performance in terms of prediction and variable selection accuracy compared to other competing regularized estimators for all scenarios considered in this study. Therefore, the adaptive $\tau$-Lasso and $\tau$-Lasso estimators provide attractive tools for a variety of sparse linear regression problems, particularly in high-dimensional settings and when the data is contaminated by outliers and high-leverage points.
翻译:本文针对响应变量和协变量(解释变量)存在严重污染的高维数据集分析,提出了一种鲁棒 $\tau$ 回归估计的正则化新版本。所得估计量称为自适应 $\tau$-Lasso,对异常值和高杠杆点具有鲁棒性。它同时引入了一个自适应的 $\ell_1$ 范数惩罚项,该惩罚项能够选择相关变量并减少与大真实回归系数相关的偏差。具体而言,此自适应 $\ell_1$ 范数惩罚项为每个回归系数分配一个权重。对于固定数量的预测变量 $p$,我们证明了自适应 $\tau$-Lasso 具有 oracle 性质,确保了变量选择一致性以及渐近正态性。渐近正态性仅适用于回归向量中对应真实支撑集的条目,其前提是已知真实回归向量的支撑集。我们通过建立有限样本崩溃点与影响函数来刻画其鲁棒性。我们进行了广泛的模拟,并观察到 $\tau$-Lasso 估计量类在受污染和未受污染的数据设置下均表现出鲁棒性和可靠的性能。我们也通过模拟验证了关于鲁棒性性质的理论发现。在面对异常值和高杠杆点时,与本研究考虑的所有场景下的其他竞争性正则化估计量相比,自适应 $\tau$-Lasso 和 $\tau$-Lasso 估计量在预测和变量选择准确性方面达到了最佳或接近最佳的性能。因此,自适应 $\tau$-Lasso 和 $\tau$-Lasso 估计量为各类稀疏线性回归问题,尤其是在高维设置以及数据被异常值和高杠杆点污染的情况下,提供了有吸引力的工具。