Challenges with data in the big-data era include (i) the dimension $p$ is often larger than the sample size $n$ (ii) outliers or contaminated points are frequently hidden and more difficult to detect. Challenge (i) renders most conventional methods inapplicable. Thus, it attracts tremendous attention from statistics, computer science, and bio-medical communities. Numerous penalized regression methods have been introduced as modern methods for analyzing high-dimensional data. Disproportionate attention has been paid to the challenge (ii) though. Penalized regression methods can do their job very well and are expected to handle the challenge (ii) simultaneously. Most of them, however, can break down by a single outlier (or single adversary contaminated point) as revealed in this article. The latter systematically examines leading penalized regression methods in the literature in terms of their robustness, provides quantitative assessment, and reveals that most of them can break down by a single outlier. Consequently, a novel robust penalized regression method based on the least sum of squares of depth trimmed residuals is proposed and studied carefully. Experiments with simulated and real data reveal that the newly proposed method can outperform some leading competitors in estimation and prediction accuracy in the cases considered.
翻译:大数据时代的数据挑战包括:(一)维度p常大于样本量n;(二)异常值或污染点常隐匿其中且更难检测。挑战(一)导致大多数传统方法不可行,因而吸引了统计学、计算机科学及生物医学界的广泛关注。针对高维数据分析,学界已提出众多惩罚回归方法作为现代分析手段。然而,挑战(二)未得到足够重视。惩罚回归方法虽能有效执行任务,但理应同时应对挑战(二)。但正如本文所示,大多数此类方法可能因单个异常值(或单个对抗性污染点)而失效。本文系统考察了文献中主流惩罚回归方法的稳健性,提供量化评估,并揭示多数方法可被单个异常值破坏。据此,本文提出并深入研究了基于深度修剪残差平方和最小化的新型稳健惩罚回归方法。模拟与真实数据实验表明,在多个考察场景下,新方法在估计和预测精度上优于部分主流对比方法。