Efficient Statistics With Unknown Truncation, Polynomial Time Algorithms, Beyond Gaussians

We study the estimation of distributional parameters when samples are shown only if they fall in some unknown set $S \subseteq \mathbb{R}^d$. Kontonis, Tzamos, and Zampetakis (FOCS'19) gave a $d^{\mathrm{poly}(1/\varepsilon)}$ time algorithm for finding $\varepsilon$-accurate parameters for the special case of Gaussian distributions with diagonal covariance matrix. Recently, Diakonikolas, Kane, Pittas, and Zarifis (COLT'24) showed that this exponential dependence on $1/\varepsilon$ is necessary even when $S$ belongs to some well-behaved classes. These works leave the following open problems which we address in this work: Can we estimate the parameters of any Gaussian or even extend beyond Gaussians? Can we design $\mathrm{poly}(d/\varepsilon)$ time algorithms when $S$ is a simple set such as a halfspace? We make progress on both of these questions by providing the following results: 1. Toward the first question, we give a $d^{\mathrm{poly}(\ell/\varepsilon)}$ time algorithm for any exponential family that satisfies some structural assumptions and any unknown set $S$ that is $\varepsilon$-approximable by degree-$\ell$ polynomials. This result has two important applications: 1a) The first algorithm for estimating arbitrary Gaussian distributions from samples truncated to an unknown $S$; and 1b) The first algorithm for linear regression with unknown truncation and Gaussian features. 2. To address the second question, we provide an algorithm with runtime $\mathrm{poly}(d/\varepsilon)$ that works for a set of exponential families (containing all Gaussians) when $S$ is a halfspace or an axis-aligned rectangle. Along the way, we develop tools that may be of independent interest, including, a reduction from PAC learning with positive and unlabeled samples to PAC learning with positive and negative samples that is robust to certain covariate shifts.

翻译：我们研究当样本仅当落入某个未知集合 $S \subseteq \mathbb{R}^d$ 时才会被观测到的情况下的分布参数估计问题。Kontonis、Tzamos 与 Zampetakis (FOCS'19) 针对对角协方差矩阵的高斯分布这一特例，提出了一种时间复杂度为 $d^{\mathrm{poly}(1/\varepsilon)}$ 的算法，用于寻找 $\varepsilon$-精确的参数。最近，Diakonikolas、Kane、Pittas 与 Zarifis (COLT'24) 指出，即使 $S$ 属于某些良好性态的函数类，这种对 $1/\varepsilon$ 的指数依赖仍然是必要的。上述工作遗留了以下开放问题，我们将在本文中予以探讨：我们能否估计任意高斯分布的参数，甚至将方法推广至高斯分布之外的分布？当 $S$ 是半空间等简单集合时，能否设计出 $\mathrm{poly}(d/\varepsilon)$ 时间复杂度的算法？我们在上述两个问题上均取得了进展，具体贡献如下：1. 针对第一个问题，我们给出了时间复杂度为 $d^{\mathrm{poly}(\ell/\varepsilon)}$ 的算法，适用于满足某些结构性假设的任意指数族分布以及任意未知集合 $S$（只要 $S$ 能被 $\ell$ 次多项式 $\varepsilon$-近似）。该结果具有两个重要应用：1a）首个能从截断于未知 $S$ 的样本中估计任意高斯分布的算法；1b）首个能处理含未知截断与高斯特征的线性回归问题的算法。2. 针对第二个问题，我们提供了一种运行时间为 $\mathrm{poly}(d/\varepsilon)$ 的算法，适用于当 $S$ 是半空间或轴对齐矩形时的一类指数族分布（包含所有高斯分布）。在此过程中，我们开发了一些可能具有独立价值的工具，包括一种从正样本和未标记样本的 PAC 学习到正样本和负样本的 PAC 学习的归约方法，该方法对某些协变量偏移具有鲁棒性。