We consider the following basic, and very broad, statistical problem: Given a known high-dimensional distribution ${\cal D}$ over $\mathbb{R}^n$ and a collection of data points in $\mathbb{R}^n$, distinguish between the two possibilities that (i) the data was drawn from ${\cal D}$, versus (ii) the data was drawn from ${\cal D}|_S$, i.e. from ${\cal D}$ subject to truncation by an unknown truncation set $S \subseteq \mathbb{R}^n$. We study this problem in the setting where ${\cal D}$ is a high-dimensional i.i.d. product distribution and $S$ is an unknown degree-$d$ polynomial threshold function (one of the most well-studied types of Boolean-valued function over $\mathbb{R}^n$). Our main results are an efficient algorithm when ${\cal D}$ is a hypercontractive distribution, and a matching lower bound: $\bullet$ For any constant $d$, we give a polynomial-time algorithm which successfully distinguishes ${\cal D}$ from ${\cal D}|_S$ using $O(n^{d/2})$ samples (subject to mild technical conditions on ${\cal D}$ and $S$); $\bullet$ Even for the simplest case of ${\cal D}$ being the uniform distribution over $\{+1, -1\}^n$, we show that for any constant $d$, any distinguishing algorithm for degree-$d$ polynomial threshold functions must use $\Omega(n^{d/2})$ samples.
翻译:我们考虑以下基础且广泛适用的统计问题:给定一个已知的高维分布${\cal D}$(定义于$\mathbb{R}^n$上)以及$\mathbb{R}^n$中的一组数据点,区分两种可能性:(i)数据来自${\cal D}$,或(ii)数据来自${\cal D}|_S$,即${\cal D}$在未知截断集$S \subseteq \mathbb{R}^n$作用下的截断分布。本研究在${\cal D}$为高维独立同分布乘积分布且$S$为未知的$d$次多项式阈值函数($\mathbb{R}^n$上最经典的布尔值函数类型之一)的设定下探讨该问题。主要结果包括针对超收缩分布${\cal D}$的高效算法及其匹配下界:$\bullet$ 对于任意常数$d$,我们提出一种多项式时间算法,能在$O(n^{d/2})$个样本下成功区分${\cal D}$与${\cal D}|_S$(需满足${\cal D}$与$S$的温和技术条件);$\bullet$ 即使对于${\cal D}$为$\{+1, -1\}^n$上均匀分布的最简情形,我们证明对任意常数$d$,任何区分$d$次多项式阈值函数的算法必须使用$\Omega(n^{d/2})$个样本。