Statistical-Computational Tradeoffs in Mixed Sparse Linear Regression

We consider the problem of mixed sparse linear regression with two components, where two real $k$-sparse signals $\beta_1, \beta_2$ are to be recovered from $n$ unlabelled noisy linear measurements. The sparsity is allowed to be sublinear in the dimension, and additive noise is assumed to be independent Gaussian with variance $\sigma^2$. Prior work has shown that the problem suffers from a $\frac{k}{SNR^2}$-to-$\frac{k^2}{SNR^2}$ statistical-to-computational gap, resembling other computationally challenging high-dimensional inference problems such as Sparse PCA and Robust Sparse Mean Estimation; here $SNR$ is the signal-to-noise ratio. We establish the existence of a more extensive computational barrier for this problem through the method of low-degree polynomials, but show that the problem is computationally hard only in a very narrow symmetric parameter regime. We identify a smooth information-computation tradeoff between the sample complexity $n$ and runtime for any randomized algorithm in this hard regime. Via a simple reduction, this provides novel rigorous evidence for the existence of a computational barrier to solving exact support recovery in sparse phase retrieval with sample complexity $n = \tilde{o}(k^2)$. Our second contribution is to analyze a simple thresholding algorithm which, outside of the narrow regime where the problem is hard, solves the associated mixed regression detection problem in $O(np)$ time with square-root the number of samples and matches the sample complexity required for (non-mixed) sparse linear regression; this allows the recovery problem to be subsequently solved by state-of-the-art techniques from the dense case. As a special case of our results, we show that this simple algorithm is order-optimal among a large family of algorithms in solving exact signed support recovery in sparse linear regression.

翻译：我们考虑具有两个分量的混合稀疏线性回归问题，其中需要从$n$个无标签的含噪线性测量中恢复两个实值$k$-稀疏信号$\beta_1, \beta_2$。稀疏性允许在维度上呈亚线性，且加性噪声假设为方差$\sigma^2$的独立高斯分布。先前研究表明，该问题存在从$\frac{k}{SNR^2}$到$\frac{k^2}{SNR^2}$的统计-计算差距，类似于稀疏主成分分析和鲁棒稀疏均值估计等计算具挑战性的高维推断问题；此处$SNR$为信噪比。我们通过低阶多项式方法证明了该问题存在更广泛的计算障碍，但发现该问题仅在一个非常狭窄的对称参数区间内具有计算困难性。在此困难区间内，我们识别出任意随机算法在样本复杂度$n$与运行时间之间的平滑信息-计算权衡。通过一个简单归约，这为在稀疏相位恢复问题中，当样本复杂度$n = \tilde{o}(k^2)$时求解精确支撑恢复存在计算障碍提供了新的严格证据。我们的第二个贡献是分析了一个简单的阈值算法，该算法在问题非困难的狭窄区域之外，能以$O(np)$的时间复杂度、仅需平方根级别的样本数来求解关联的混合回归检测问题，并匹配（非混合）稀疏线性回归所需的样本复杂度；这使得后续可借助密集情况下的先进技术求解恢复问题。作为我们结果的一个特例，我们表明该简单算法在求解稀疏线性回归中的精确符号支撑恢复问题时，在一大类算法中达到阶数最优。