We study the problem of exact support recovery for high-dimensional sparse linear regression under independent Gaussian design when the signals are weak, rare, and possibly heterogeneous. Under a suitable scaling of the sample size and signal sparsity, we fix the minimum signal magnitude at the information-theoretic optimal rate and investigate the asymptotic selection accuracy of best subset selection (BSS) and marginal screening (MS) procedures. We show that despite the ideal setup, somewhat surprisingly, marginal screening can fail to achieve exact recovery with probability converging to one in the presence of heterogeneous signals, whereas BSS enjoys model consistency whenever the minimum signal strength is above the information-theoretic threshold. To mitigate the computational intractability of BSS, we also propose an efficient two-stage algorithmic framework called ETS (Estimate Then Screen) comprised of an estimation step and gradient coordinate screening step, and under the same scaling assumption on sample size and sparsity, we show that ETS achieves model consistency under the same information-theoretic optimal requirement on the minimum signal strength as BSS. Finally, we present a simulation study comparing ETS with LASSO and marginal screening. The numerical results agree with our asymptotic theory even for realistic values of the sample size, dimension and sparsity.
翻译:我们研究独立高斯设计下高维稀疏线性回归中精确支持恢复问题,其中信号具有弱、稀少且可能异质的特征。在样本量与信号稀疏度适当标度下,我们将最小信号强度固定在信息论最优速率,并研究最优子集选择(BSS)和边际筛选(MS)过程的渐近选择精度。研究表明,尽管处于理想设定,但令人惊讶的是,在异质信号存在时边际筛选以概率趋近于1无法实现精确恢复,而BSS只要最小信号强度高于信息论阈值即可保持模型一致性。为解决BSS的计算棘突问题,我们提出名为ETS(先估计后筛选)的高效两阶段算法框架,包含估计步骤和梯度坐标筛选步骤。在相同样本量与稀疏度标度假设下,证明ETS与BSS在最小信号强度的信息论最优要求下同样实现模型一致性。最后,我们通过仿真实验比较ETS、LASSO和边际筛选的性能。数值结果与渐近理论在样本量、维度和稀疏度的实际取值下呈现高度吻合。