Sparse Linear Regression and Lattice Problems

Sparse linear regression (SLR) is a well-studied problem in statistics where one is given a design matrix $X\in\mathbb{R}^{m\times n}$ and a response vector $y=X\theta^*+w$ for a $k$-sparse vector $\theta^*$ (that is, $\|\theta^*\|_0\leq k$) and small, arbitrary noise $w$, and the goal is to find a $k$-sparse $\widehat{\theta} \in \mathbb{R}^n$ that minimizes the mean squared prediction error $\frac{1}{m}\|X\widehat{\theta}-X\theta^*\|^2_2$. While $\ell_1$-relaxation methods such as basis pursuit, Lasso, and the Dantzig selector solve SLR when the design matrix is well-conditioned, no general algorithm is known, nor is there any formal evidence of hardness in an average-case setting with respect to all efficient algorithms. We give evidence of average-case hardness of SLR w.r.t. all efficient algorithms assuming the worst-case hardness of lattice problems. Specifically, we give an instance-by-instance reduction from a variant of the bounded distance decoding (BDD) problem on lattices to SLR, where the condition number of the lattice basis that defines the BDD instance is directly related to the restricted eigenvalue condition of the design matrix, which characterizes some of the classical statistical-computational gaps for sparse linear regression. Also, by appealing to worst-case to average-case reductions from the world of lattices, this shows hardness for a distribution of SLR instances; while the design matrices are ill-conditioned, the resulting SLR instances are in the identifiable regime. Furthermore, for well-conditioned (essentially) isotropic Gaussian design matrices, where Lasso is known to behave well in the identifiable regime, we show hardness of outputting any good solution in the unidentifiable regime where there are many solutions, assuming the worst-case hardness of standard and well-studied lattice problems.

翻译：稀疏线性回归（Sparse Linear Regression, SLR）是统计学中一个经典问题：给定设计矩阵 $X\in\mathbb{R}^{m\times n}$ 和响应向量 $y=X\theta^*+w$，其中 $\theta^*$ 为 $k$-稀疏向量（即 $\|\theta^*\|_0\leq k$），$w$ 为有界随机噪声，目标为寻找一个 $k$-稀疏估计 $\widehat{\theta} \in \mathbb{R}^n$，最小化均方预测误差 $\frac{1}{m}\|X\widehat{\theta}-X\theta^*\|^2_2$。当设计矩阵良态时，基追踪、Lasso 和 Dantzig 选择器等 $\ell_1$ 松弛方法可求解 SLR；但对于一般情形，目前既无通用算法，也不存在针对所有高效算法的平均情况困难性形式化证据。我们基于格问题的最坏情况困难性，证明了 SLR 相对于所有高效算法的平均情况困难性。具体而言，我们建立了从格上有界距离解码（Bounded Distance Decoding, BDD）问题的变种到 SLR 的逐实例归约，其中定义 BDD 实例的格基条件数直接关联于设计矩阵的限制特征值条件——该条件刻画了稀疏线性回归中若干经典统计-计算差距。此外，利用格领域中最坏情况到平均情况的归约技术，我们证明了 SLR 实例分布在某种分布下的困难性：尽管设计矩阵病态，所得 SLR 实例仍处于可辨识区间。更进一步，对于良态（本质上的）各向同性高斯设计矩阵——在该场景下 Lasso 在可辨识区间表现良好——我们假设标准且被广泛研究的格问题具有最坏情况困难性，从而证明了在存在多解的非可辨识区间输出任何有效解具有困难性。