In high-dimensional generalized linear models, it is crucial to identify a sparse model that adequately accounts for response variation. Although the best subset section has been widely regarded as the Holy Grail of problems of this type, achieving either computational efficiency or statistical guarantees is challenging. In this article, we intend to surmount this obstacle by utilizing a fast algorithm to select the best subset with high certainty. We proposed and illustrated an algorithm for best subset recovery in regularity conditions. Under mild conditions, the computational complexity of our algorithm scales polynomially with sample size and dimension. In addition to demonstrating the statistical properties of our method, extensive numerical experiments reveal that it outperforms existing methods for variable selection and coefficient estimation. The runtime analysis shows that our implementation achieves approximately a fourfold speedup compared to popular variable selection toolkits like glmnet and ncvreg.
翻译:在高维广义线性模型中,识别一个能够充分解释响应变异的稀疏模型至关重要。尽管最优子集选择被广泛视为此类问题的"圣杯",但实现计算效率或统计保障仍颇具挑战。本文旨在通过利用快速算法以高置信度选择最优子集来克服这一障碍。我们提出并阐述了一种在正则条件下恢复最优子集的算法。在温和条件下,该算法的计算复杂度随样本量和维度呈多项式增长。除展示方法的统计性质外,大量数值实验表明,该方法在变量选择和系数估计方面优于现有方法。运行时间分析显示,与流行的变量选择工具包(如glmnet和ncvreg)相比,我们的实现实现了约四倍的加速。