We study the problem of recovering Gaussian data under adversarial corruptions when the noises are low-rank and the corruptions are on the coordinate level. Concretely, we assume that the Gaussian noises lie in an unknown $k$-dimensional subspace $U \subseteq \mathbb{R}^d$, and $s$ randomly chosen coordinates of each data point fall into the control of an adversary. This setting models the scenario of learning from high-dimensional yet structured data that are transmitted through a highly-noisy channel, so that the data points are unlikely to be entirely clean. Our main result is an efficient algorithm that, when $ks^2 = O(d)$, recovers every single data point up to a nearly-optimal $\ell_1$ error of $\tilde O(ks/d)$ in expectation. At the core of our proof is a new analysis of the well-known Basis Pursuit (BP) method for recovering a sparse signal, which is known to succeed under additional assumptions (e.g., incoherence or the restricted isometry property) on the underlying subspace $U$. In contrast, we present a novel approach via studying a natural combinatorial problem and show that, over the randomness in the support of the sparse signal, a high-probability error bound is possible even if the subspace $U$ is arbitrary.
翻译:我们研究在噪声为低秩且数据坐标级遭受对抗性破坏的情况下,恢复高斯数据的问题。具体而言,假设高斯噪声位于未知的 $k$ 维子空间 $U \subseteq \mathbb{R}^d$ 中,且每个数据点的 $s$ 个随机选取的坐标落入敌手控制范围。该设置模拟了从高维结构化数据中学习的场景,此类数据通过高噪声信道传输,因此数据点难以完全干净。我们的主要成果是一个高效算法:当 $ks^2 = O(d)$ 时,该算法能以期望误差 $\tilde O(ks/d)$ 的近似最优 $\ell_1$ 范数恢复每个数据点。证明的核心是对经典基追踪(Basis Pursuit, BP)方法的新分析,该方法用于恢复稀疏信号,且已知在底层子空间 $U$ 满足额外假设(如不相干性或受限等距性质)时能够成功。相比之下,我们通过研究一个自然组合问题提出新方法,表明在稀疏信号支撑集的随机性下,即使子空间 $U$ 是任意的,也能得到高概率误差界。