Testing Closeness of Multivariate Distributions via Ramsey Theory

We investigate the statistical task of closeness (or equivalence) testing for multidimensional distributions. Specifically, given sample access to two unknown distributions $\mathbf p, \mathbf q$ on $\mathbb R^d$, we want to distinguish between the case that $\mathbf p=\mathbf q$ versus $\|\mathbf p-\mathbf q\|_{A_k} > \epsilon$, where $\|\mathbf p-\mathbf q\|_{A_k}$ denotes the generalized ${A}_k$ distance between $\mathbf p$ and $\mathbf q$ -- measuring the maximum discrepancy between the distributions over any collection of $k$ disjoint, axis-aligned rectangles. Our main result is the first closeness tester for this problem with {\em sub-learning} sample complexity in any fixed dimension and a nearly-matching sample complexity lower bound. In more detail, we provide a computationally efficient closeness tester with sample complexity $O\left((k^{6/7}/ \mathrm{poly}_d(\epsilon)) \log^d(k)\right)$. On the lower bound side, we establish a qualitatively matching sample complexity lower bound of $\Omega(k^{6/7}/\mathrm{poly}(\epsilon))$, even for $d=2$. These sample complexity bounds are surprising because the sample complexity of the problem in the univariate setting is $\Theta(k^{4/5}/\mathrm{poly}(\epsilon))$. This has the interesting consequence that the jump from one to two dimensions leads to a substantial increase in sample complexity, while increases beyond that do not. As a corollary of our general $A_k$ tester, we obtain $d_{\mathrm TV}$-closeness testers for pairs of $k$-histograms on $\mathbb R^d$ over a common unknown partition, and pairs of uniform distributions supported on the union of $k$ unknown disjoint axis-aligned rectangles. Both our algorithm and our lower bound make essential use of tools from Ramsey theory.

翻译：我们研究多维分布接近性（或等价性）检验的统计任务。具体而言，给定对$\mathbb R^d$上两个未知分布$\mathbf p, \mathbf q$的样本访问，我们需要区分$\mathbf p=\mathbf q$与$\|\mathbf p-\mathbf q\|_{A_k} > \epsilon$两种情况，其中$\|\mathbf p-\mathbf q\|_{A_k}$表示$\mathbf p$与$\mathbf q$之间的广义${A}_k$距离——衡量分布在任何$k$个不相交轴对齐矩形集合上的最大差异。我们的主要结果是该问题在任意固定维度上首个具有**次学习**样本复杂度的接近性检验器，以及近乎匹配的样本复杂度下界。更详细地说，我们提供了一个计算高效的接近性检验器，其样本复杂度为$O\left((k^{6/7}/ \mathrm{poly}_d(\epsilon)) \log^d(k)\right)$。在下界方面，我们建立了定性匹配的样本复杂度下界$\Omega(k^{6/7}/\mathrm{poly}(\epsilon))$，即使对于$d=2$也是如此。这些样本复杂度界限令人惊讶，因为该问题在一维情况下的样本复杂度为$\Theta(k^{4/5}/\mathrm{poly}(\epsilon))$。这产生了一个有趣的结论：从一维到二维的跃迁导致样本复杂度显著增加，而超过二维后的增加则不再显著。作为我们通用$A_k$检验器的推论，我们得到了$\mathbb R^d$上基于共同未知划分的$k$直方图对，以及支撑在$k$个未知不相交轴对齐矩形并集上的均匀分布对的$d_{\mathrm TV}$接近性检验器。我们的算法和下界都本质性地使用了拉姆齐理论中的工具。