Low-rank matrix estimation under heavy-tailed noise is challenging, both computationally and statistically. Convex approaches have been proven statistically optimal but suffer from high computational costs, especially since robust loss functions are usually non-smooth. More recently, computationally fast non-convex approaches via sub-gradient descent are proposed, which, unfortunately, fail to deliver a statistically consistent estimator even under sub-Gaussian noise. In this paper, we introduce a novel Riemannian sub-gradient (RsGrad) algorithm which is not only computationally efficient with linear convergence but also is statistically optimal, be the noise Gaussian or heavy-tailed. Convergence theory is established for a general framework and specific applications to absolute loss, Huber loss, and quantile loss are investigated. Compared with existing non-convex methods, ours reveals a surprising phenomenon of dual-phase convergence. In phase one, RsGrad behaves as in a typical non-smooth optimization that requires gradually decaying stepsizes. However, phase one only delivers a statistically sub-optimal estimator which is already observed in the existing literature. Interestingly, during phase two, RsGrad converges linearly as if minimizing a smooth and strongly convex objective function and thus a constant stepsize suffices. Underlying the phase-two convergence is the smoothing effect of random noise to the non-smooth robust losses in an area close but not too close to the truth. Lastly, RsGrad is applicable for low-rank tensor estimation under heavy-tailed noise where a statistically optimal rate is attainable with the same phenomenon of dual-phase convergence, and a novel shrinkage-based second-order moment method is guaranteed to deliver a warm initialization. Numerical simulations confirm our theoretical discovery and showcase the superiority of RsGrad over prior methods.
翻译:重尾噪声下的低秩矩阵估计在计算和统计两方面均具有挑战性。凸方法已被证明在统计上最优,但计算成本高昂,尤其因为鲁棒损失函数通常非光滑。近期提出的基于次梯度下降的快速非凸方法,即使在次高斯噪声下也无法给出统计一致的估计量。本文提出一种新的黎曼次梯度(RsGrad)算法,该算法不仅具有线性收敛的计算高效性,而且在高斯或重尾噪声下均统计最优。我们为通用框架建立了收敛理论,并研究了其在绝对值损失、Huber损失和分位数损失中的具体应用。与现有非凸方法相比,我们的算法展现出令人惊讶的双阶段收敛现象。第一阶段中,RsGrad表现为典型的非光滑优化,需要逐渐减小的步长。然而,第一阶段仅能给出统计次优的估计量(这一点已在现有文献中观察到)。有趣的是,第二阶段中,RsGrad如同最小化光滑且强凸的目标函数一样线性收敛,因此恒定步长即可满足要求。第二阶段收敛的潜在机制是,在靠近但并非过于接近真实值的区域内,随机噪声对非光滑鲁棒损失产生的平滑效应。最后,RsGrad可应用于重尾噪声下的低秩张量估计,在双阶段收敛现象下获得统计最优速率,且一种新颖的基于收缩的二阶矩方法能保证提供热启动。数值模拟验证了我们的理论发现,并展示了RsGrad相较于现有方法的优越性。