We propose statistically robust and computationally efficient linear learning methods in the high-dimensional batch setting, where the number of features $d$ may exceed the sample size $n$. We employ, in a generic learning setting, two algorithms depending on whether the considered loss function is gradient-Lipschitz or not. Then, we instantiate our framework on several applications including vanilla sparse, group-sparse and low-rank matrix recovery. This leads, for each application, to efficient and robust learning algorithms, that reach near-optimal estimation rates under heavy-tailed distributions and the presence of outliers. For vanilla $s$-sparsity, we are able to reach the $s\log (d)/n$ rate under heavy-tails and $\eta$-corruption, at a computational cost comparable to that of non-robust analogs. We provide an efficient implementation of our algorithms in an open-source $\mathtt{Python}$ library called $\mathtt{linlearn}$, by means of which we carry out numerical experiments which confirm our theoretical findings together with a comparison to other recent approaches proposed in the literature.
翻译:我们提出了在高维批处理场景下统计鲁棒且计算高效的线性学习方法,其中特征数量$d$可能超过样本量$n$。我们采用两种算法,分别适用于所考虑的损失函数是否具有梯度-Lipschitz性质的一般学习场景。随后,我们将此框架应用于多种场景,包括经典稀疏、组稀疏和低秩矩阵恢复。针对每种应用场景,我们得到高效且鲁棒的学习算法,这些算法在重尾分布和存在异常值的情况下能够达到近乎最优的估计速率。对于经典$s$-稀疏情形,我们能够在重尾分布和$\eta$-污染条件下达到$s\log (d)/n$的速率,其计算成本与非鲁棒方法相当。我们通过开源$\mathtt{Python}$库$\mathtt{linlearn}$实现了所提算法的高效实现,并在此基础上开展数值实验,实验结果验证了我们的理论发现,同时与近期文献中提出的其他方法进行了比较。