In this paper, we find a sample complexity bound for learning a simplex from noisy samples. Assume a dataset of size $n$ is given which includes i.i.d. samples drawn from a uniform distribution over an unknown simplex in $\mathbb{R}^K$, where samples are assumed to be corrupted by a multi-variate additive Gaussian noise of an arbitrary magnitude. We prove the existence of an algorithm that with high probability outputs a simplex having a $\ell_2$ distance of at most $\varepsilon$ from the true simplex (for any $\varepsilon>0$). Also, we theoretically show that in order to achieve this bound, it is sufficient to have $n\ge\left(K^2/\varepsilon^2\right)e^{\Omega\left(K/\mathrm{SNR}^2\right)}$ samples, where $\mathrm{SNR}$ stands for the signal-to-noise ratio. This result solves an important open problem and shows as long as $\mathrm{SNR}\ge\Omega\left(K^{1/2}\right)$, the sample complexity of the noisy regime has the same order to that of the noiseless case. Our proofs are a combination of the so-called sample compression technique in \citep{ashtiani2018nearly}, mathematical tools from high-dimensional geometry, and Fourier analysis. In particular, we have proposed a general Fourier-based technique for recovery of a more general class of distribution families from additive Gaussian noise, which can be further used in a variety of other related problems.
翻译:本文研究了从含噪样本中学习单纯形的样本复杂度界限。假设给定大小为$n$的数据集,其中包含从$\mathbb{R}^K$中未知单纯形均匀分布抽取的独立同分布样本,且样本受到任意幅度的多元加性高斯噪声污染。我们证明存在一种算法,能以高概率输出与真实单纯形$\ell_2$距离不超过$\varepsilon$(对任意$\varepsilon>0$)的单纯形。同时,我们从理论上表明,为实现该界限,仅需满足$n\ge\left(K^2/\varepsilon^2\right)e^{\Omega\left(K/\mathrm{SNR}^2\right)}$个样本,其中$\mathrm{SNR}$表示信噪比。该结果解决了一个重要开放问题,表明当$\mathrm{SNR}\ge\Omega\left(K^{1/2}\right)$时,噪声环境的样本复杂度与无噪声情况具有相同量级。我们的证明方法结合了文献\citep{ashtiani2018nearly}中的样本压缩技术、高维几何数学工具及傅里叶分析。特别地,我们提出了一种通用的基于傅里叶的技术,用于从加性高斯噪声中恢复更广泛的分布族,该技术可进一步应用于其他相关问题的研究。