Group lasso is a commonly used regularization method in statistical learning in which parameters are eliminated from the model according to predefined groups. However, when the groups overlap, optimizing the group lasso penalized objective can be time-consuming on large-scale problems because of the non-separability induced by the overlapping groups. This bottleneck has seriously limited the application of overlapping group lasso regularization in many modern problems, such as gene pathway selection and graphical model estimation. In this paper, we propose a separable penalty as an approximation of the overlapping group lasso penalty. Thanks to the separability, the computation of regularization based on our penalty is substantially faster than that of the overlapping group lasso, especially for large-scale and high-dimensional problems. We show that the penalty is the tightest separable relaxation of the overlapping group lasso norm within the family of $\ell_{q_1}/\ell_{q_2}$ norms. Moreover, we show that the estimator based on the proposed separable penalty is statistically equivalent to the one based on the overlapping group lasso penalty with respect to their error bounds and the rate-optimal performance under the squared loss. We demonstrate the faster computational time and statistical equivalence of our method compared with the overlapping group lasso in simulation examples and a classification problem of cancer tumors based on gene expression and multiple gene pathways.
翻译:组套索是统计学习中常用的正则化方法,它根据预定义的组从模型中剔除参数。然而,当组之间存在重叠时,由于重叠组导致的不可分离性,在大规模问题上优化组套索惩罚目标会耗费大量时间。这一瓶颈严重限制了重叠组套索正则化在基因通路选择和图形模型估计等现代问题中的应用。本文提出了一种可分离惩罚项作为重叠组套索惩罚的近似。得益于其可分离性,基于该惩罚项的正则化计算速度显著快于重叠组套索,尤其在处理大规模和高维问题时。我们证明,在 $\ell_{q_1}/\ell_{q_2}$ 范数族中,该惩罚项是重叠组套索范数的最紧可分离松弛。此外,我们证明基于该可分离惩罚项的估计量与基于重叠组套索惩罚的估计量在误差界和平方损失下的速率最优性能方面是统计等价的。通过模拟实验以及一个基于基因表达和多个基因通路的癌症肿瘤分类问题,我们展示了该方法相比重叠组套索在计算时间上的优势及其统计等价性。