We study the problem of list-decodable sparse mean estimation. Specifically, for a parameter $\alpha \in (0, 1/2)$, we are given $m$ points in $\mathbb{R}^n$, $\lfloor \alpha m \rfloor$ of which are i.i.d. samples from a distribution $D$ with unknown $k$-sparse mean $\mu$. No assumptions are made on the remaining points, which form the majority of the dataset. The goal is to return a small list of candidates containing a vector $\widehat \mu$ such that $\| \widehat \mu - \mu \|_2$ is small. Prior work had studied the problem of list-decodable mean estimation in the dense setting. In this work, we develop a novel, conceptually simpler technique for list-decodable mean estimation. As the main application of our approach, we provide the first sample and computationally efficient algorithm for list-decodable sparse mean estimation. In particular, for distributions with "certifiably bounded" $t$-th moments in $k$-sparse directions and sufficiently light tails, our algorithm achieves error of $(1/\alpha)^{O(1/t)}$ with sample complexity $m = (k\log(n))^{O(t)}/\alpha$ and running time $\mathrm{poly}(mn^t)$. For the special case of Gaussian inliers, our algorithm achieves the optimal error guarantee of $\Theta (\sqrt{\log(1/\alpha)})$ with quasi-polynomial sample and computational complexity. We complement our upper bounds with nearly-matching statistical query and low-degree polynomial testing lower bounds.
翻译:我们研究列表可解码稀疏均值估计问题。具体而言,对于参数 $\alpha \in (0, 1/2)$,给定 $\mathbb{R}^n$ 空间中的 $m$ 个点,其中 $\lfloor \alpha m \rfloor$ 个点是从具有未知 $k$ 稀疏均值 $\mu$ 的分布 $D$ 中独立同分布采样得到的。对构成数据集主体的其余点不作任何假设。目标是返回一个包含向量 $\widehat \mu$ 的候选短列表,使得 $\| \widehat \mu - \mu \|_2$ 足够小。先前研究已在稠密设定下探讨了列表可解码均值估计问题。本工作提出了一种新颖且概念更简洁的列表可解码均值估计技术。作为该方法的主要应用,我们首次给出了样本与计算效率兼备的列表可解码稀疏均值估计算法。特别地,对于在 $k$ 稀疏方向上具有"可认证有界" $t$ 阶矩且尾部充分轻的分布,我们的算法能以 $m = (k\log(n))^{O(t)}/\alpha$ 的样本复杂度和 $\mathrm{poly}(mn^t)$ 的运行时间达到 $(1/\alpha)^{O(1/t)}$ 的误差。针对高斯内点的特殊情况,我们的算法能以拟多项式样本和计算复杂度达到 $\Theta (\sqrt{\log(1/\alpha)})$ 的最优误差界。我们通过近乎匹配的统计查询下界和低阶多项式检测下界,对所得上界进行了理论补充。