High-dimensional linear regression under heavy-tailed noise or outlier corruption is challenging, both computationally and statistically. Convex approaches have been proven statistically optimal but suffer from high computational costs, especially since the robust loss functions are usually non-smooth. More recently, computationally fast non-convex approaches via sub-gradient descent are proposed, which, unfortunately, fail to deliver a statistically consistent estimator even under sub-Gaussian noise. In this paper, we introduce a projected sub-gradient descent algorithm for both the sparse linear regression and low-rank linear regression problems. The algorithm is not only computationally efficient with linear convergence but also statistically optimal, be the noise Gaussian or heavy-tailed with a finite 1 + epsilon moment. The convergence theory is established for a general framework and its specific applications to absolute loss, Huber loss and quantile loss are investigated. Compared with existing non-convex methods, ours reveals a surprising phenomenon of two-phase convergence. In phase one, the algorithm behaves as in typical non-smooth optimization that requires gradually decaying stepsizes. However, phase one only delivers a statistically sub-optimal estimator, which is already observed in the existing literature. Interestingly, during phase two, the algorithm converges linearly as if minimizing a smooth and strongly convex objective function, and thus a constant stepsize suffices. Underlying the phase-two convergence is the smoothing effect of random noise to the non-smooth robust losses in an area close but not too close to the truth. Numerical simulations confirm our theoretical discovery and showcase the superiority of our algorithm over prior methods.
翻译:在高斯噪声重尾或异常值污染条件下,高维线性回归在计算与统计层面均具有挑战性。凸方法已被证明在统计上最优,但计算成本高昂,尤其是由于鲁棒损失函数通常非光滑。近期提出的基于次梯度下降的快速非凸方法,即便在次高斯噪声下也无法得到统计一致的估计量。本文针对稀疏线性回归与低秩线性回归问题,提出一种投影次梯度下降算法。该算法不仅具备线性收敛的计算高效性,还能在噪声服从高斯分布或具有有限1+ε矩的重尾分布时达到统计最优。我们建立了通用框架的收敛理论,并研究了其在绝对损失、Huber损失与分位数损失中的具体应用。与现有非凸方法相比,我们揭示了令人惊异的双阶段收敛现象:阶段一中算法表现为典型非光滑优化,需逐步衰减步长,但仅能得到统计次优的估计量(该现象已在现有文献中观察到);而值得注意的是,阶段二中算法以类似最小化光滑强凸目标函数的方式线性收敛,因此恒定步长即可满足要求。阶段二收敛的内在机理是:在接近但未过度接近真实值的区域内,随机噪声对非光滑鲁棒损失函数产生了平滑化效应。数值模拟验证了理论发现,并展示了我们方法相较于先前算法的优越性。