Batch List-Decodable Linear Regression via Higher Moments

We study the task of list-decodable linear regression using batches. A batch is called clean if it consists of i.i.d. samples from an unknown linear regression distribution. For a parameter $\alpha \in (0, 1/2)$, an unknown $\alpha$-fraction of the batches are clean and no assumptions are made on the remaining ones. The goal is to output a small list of vectors at least one of which is close to the true regressor vector in $\ell_2$-norm. [DJKS23] gave an efficient algorithm, under natural distributional assumptions, with the following guarantee. Assuming that the batch size $n$ satisfies $n \geq \tilde{\Omega}(\alpha^{-1})$ and the number of batches is $m = \mathrm{poly}(d, n, 1/\alpha)$, their algorithm runs in polynomial time and outputs a list of $O(1/\alpha^2)$ vectors at least one of which is $\tilde{O}(\alpha^{-1/2}/\sqrt{n})$ close to the target regressor. Here we design a new polynomial time algorithm with significantly stronger guarantees under the assumption that the low-degree moments of the covariates distribution are Sum-of-Squares (SoS) certifiably bounded. Specifically, for any constant $\delta>0$, as long as the batch size is $n \geq \Omega_{\delta}(\alpha^{-\delta})$ and the degree-$\Theta(1/\delta)$ moments of the covariates are SoS certifiably bounded, our algorithm uses $m = \mathrm{poly}((dn)^{1/\delta}, 1/\alpha)$ batches, runs in polynomial-time, and outputs an $O(1/\alpha)$-sized list of vectors one of which is $O(\alpha^{-\delta/2}/\sqrt{n})$ close to the target. That is, our algorithm achieves substantially smaller minimum batch size and final error, while achieving the optimal list size. Our approach uses higher-order moment information by carefully combining the SoS paradigm interleaved with an iterative method and a novel list pruning procedure. In the process, we give an SoS proof of the Marcinkiewicz-Zygmund inequality that may be of broader applicability.

翻译：我们研究基于批量的列表可译线性回归任务。若一个批量包含从未知线性回归分布中独立同分布抽取的样本，则称该批量为干净批量。对于参数 $\alpha \in (0, 1/2)$，已知有未知的 $\alpha$ 比例批量为干净批量，且对其余批量不作任何假设。目标在于输出一个向量短列表，其中至少有一个向量在 $\ell_2$ 范数下接近真实的回归向量。[DJKS23] 在自然分布假设下提出了一种高效算法，其保证如下：假设批量大小 $n$ 满足 $n \geq \tilde{\Omega}(\alpha^{-1})$ 且批量数量为 $m = \mathrm{poly}(d, n, 1/\alpha)$，该算法在多项式时间内运行并输出包含 $O(1/\alpha^2)$ 个向量的列表，其中至少有一个向量与目标回归量的距离为 $\tilde{O}(\alpha^{-1/2}/\sqrt{n})$。本文设计了一种新的多项式时间算法，在协变量分布的低阶矩具有平方和可证有界性的假设下，实现了显著更强的保证。具体而言，对于任意常数 $\delta>0$，只要批量大小满足 $n \geq \Omega_{\delta}(\alpha^{-\delta})$ 且协变量的 $\Theta(1/\delta)$ 阶矩具有平方和可证有界性，我们的算法使用 $m = \mathrm{poly}((dn)^{1/\delta}, 1/\alpha)$ 个批量，在多项式时间内运行，并输出包含 $O(1/\alpha)$ 个向量的列表，其中至少有一个向量与目标回归量的距离为 $O(\alpha^{-\delta/2}/\sqrt{n})$。换言之，我们的算法在达到最优列表大小的同时，实现了显著更小的最小批量需求和最终误差。我们的方法通过精心结合平方和范式、迭代方法与新颖的列表剪枝过程，有效利用了高阶矩信息。在此过程中，我们给出了马尔钦凯维奇-齐格蒙德不等式的平方和证明，该证明可能具有更广泛的适用性。