Sparse linear regression is a central problem in high-dimensional statistics. We study the correlated random design setting, where the covariates are drawn from a multivariate Gaussian $N(0,\Sigma)$, and we seek an estimator with small excess risk. If the true signal is $t$-sparse, information-theoretically, it is possible to achieve strong recovery guarantees with only $O(t\log n)$ samples. However, computationally efficient algorithms have sample complexity linear in (some variant of) the condition number of $\Sigma$. Classical algorithms such as the Lasso can require significantly more samples than necessary even if there is only a single sparse approximate dependency among the covariates. We provide a polynomial-time algorithm that, given $\Sigma$, automatically adapts the Lasso to tolerate a small number of approximate dependencies. In particular, we achieve near-optimal sample complexity for constant sparsity and if $\Sigma$ has few ``outlier'' eigenvalues. Our algorithm fits into a broader framework of feature adaptation for sparse linear regression with ill-conditioned covariates. With this framework, we additionally provide the first polynomial-factor improvement over brute-force search for constant sparsity $t$ and arbitrary covariance $\Sigma$.
翻译:稀疏线性回归是高维统计学中的核心问题。本研究考虑协变量服从多元高斯分布$N(0,\Sigma)$的相关随机设计场景,旨在寻找具有较小超额风险的估计量。当真值信号为$t$-稀疏时,从信息论角度看,仅需$O(t\log n)$个样本即可实现强恢复保证。然而,计算高效的算法所需样本复杂度与$\Sigma$的条件数(的某种变体)呈线性关系。即便协变量间仅存在单个稀疏近似依赖关系,Lasso等经典算法所需的样本量也可能远超必要值。我们提出一种多项式时间算法,该算法可在给定$\Sigma$的情况下自动调整Lasso以容忍少量近似依赖关系。具体而言,在常值稀疏度且$\Sigma$具有少量"异常"特征值的条件下,我们实现了近乎最优的样本复杂度。该算法遵循病态协变量稀疏线性回归中特征自适应方法的更广泛框架。在此框架下,对于常值稀疏度$t$和任意协方差矩阵$\Sigma$,我们首次实现了相对于暴力搜索的多项式因子改进。