We study the basic statistical problem of testing whether normally distributed $n$-dimensional data has been truncated, i.e. altered by only retaining points that lie in some unknown truncation set $S \subseteq \mathbb{R}^n$. As our main algorithmic results, (1) We give a computationally efficient $O(n)$-sample algorithm that can distinguish the standard normal distribution $N(0,I_n)$ from $N(0,I_n)$ conditioned on an unknown and arbitrary convex set $S$. (2) We give a different computationally efficient $O(n)$-sample algorithm that can distinguish $N(0,I_n)$ from $N(0,I_n)$ conditioned on an unknown and arbitrary mixture of symmetric convex sets. These results stand in sharp contrast with known results for learning or testing convex bodies with respect to the normal distribution or learning convex-truncated normal distributions, where state-of-the-art algorithms require essentially $n^{\sqrt{n}}$ samples. An easy argument shows that no finite number of samples suffices to distinguish $N(0,I_n)$ from an unknown and arbitrary mixture of general (not necessarily symmetric) convex sets, so no common generalization of results (1) and (2) above is possible. We also prove that any algorithm (computationally efficient or otherwise) that can distinguish $N(0,I_n)$ from $N(0,I_n)$ conditioned on an unknown symmetric convex set must use $\Omega(n)$ samples. This shows that the sample complexity of each of our algorithms is optimal up to a constant factor.
翻译:我们研究检验正态分布$n$维数据是否被截断(即仅保留位于某个未知截断集$S \subseteq \mathbb{R}^n$中的点)这一基本统计问题。作为主要算法成果,(1) 我们给出一个计算高效的$O(n)$样本算法,能够区分标准正态分布$N(0,I_n)$与条件于未知且任意凸集$S$的$N(0,I_n)$。(2) 我们给出另一个计算高效的$O(n)$样本算法,能够区分$N(0,I_n)$与条件于未知且任意对称凸集混合的$N(0,I_n)$。这些结果与当前关于基于正态分布学习或检验凸体、或学习凸截断正态分布的已知结果形成鲜明对比——后者的最新算法需要本质上$n^{\sqrt{n}}$个样本。一个简单论证表明,任何有限样本数都不足以区分$N(0,I_n)$与条件于未知且任意一般(不必对称)凸集混合的$N(0,I_n)$,因此上述结果(1)和(2)不可能存在共同推广。我们还证明,任何能够区分$N(0,I_n)$与条件于未知对称凸集的$N(0,I_n)$的算法(无论计算高效与否)必须使用$\Omega(n)$个样本。这表明我们的每个算法的样本复杂度在常数因子内均为最优。