Efficient Statistics With Unknown Truncation, Polynomial Time Algorithms, Beyond Gaussians

We study the estimation of distributional parameters when samples are shown only if they fall in some unknown set $S \subseteq \mathbb{R}^d$. Kontonis, Tzamos, and Zampetakis (FOCS'19) gave a $d^{\mathrm{poly}(1/\varepsilon)}$ time algorithm for finding $\varepsilon$-accurate parameters for the special case of Gaussian distributions with diagonal covariance matrix. Recently, Diakonikolas, Kane, Pittas, and Zarifis (COLT'24) showed that this exponential dependence on $1/\varepsilon$ is necessary even when $S$ belongs to some well-behaved classes. These works leave the following open problems which we address in this work: Can we estimate the parameters of any Gaussian or even extend beyond Gaussians? Can we design $\mathrm{poly}(d/\varepsilon)$ time algorithms when $S$ is a simple set such as a halfspace? We make progress on both of these questions by providing the following results: 1. Toward the first question, we give a $d^{\mathrm{poly}(\ell/\varepsilon)}$ time algorithm for any exponential family that satisfies some structural assumptions and any unknown set $S$ that is $\varepsilon$-approximable by degree-$\ell$ polynomials. This result has two important applications: 1a) The first algorithm for estimating arbitrary Gaussian distributions from samples truncated to an unknown $S$; and 1b) The first algorithm for linear regression with unknown truncation and Gaussian features. 2. To address the second question, we provide an algorithm with runtime $\mathrm{poly}(d/\varepsilon)$ that works for a set of exponential families (containing all Gaussians) when $S$ is a halfspace or an axis-aligned rectangle. Along the way, we develop tools that may be of independent interest, including, a reduction from PAC learning with positive and unlabeled samples to PAC learning with positive and negative samples that is robust to certain covariate shifts.

翻译：我们研究当样本仅在其落入某个未知集合 $S \subseteq \mathbb{R}^d$ 时才被观测到的分布参数估计问题。Kontonis、Tzamos 和 Zampetakis (FOCS'19) 针对对角协方差矩阵的高斯分布这一特殊情况，提出了一个 $d^{\mathrm{poly}(1/\varepsilon)}$ 时间算法来寻找 $\varepsilon$ 精度的参数。最近，Diakonikolas、Kane、Pittas 和 Zarifis (COLT'24) 证明，即使 $S$ 属于某些性质良好的类别，这种对 $1/\varepsilon$ 的指数依赖也是必要的。这些工作遗留了以下开放问题，我们在本文中予以解决：我们能否估计任意高斯分布的参数，甚至推广到高斯分布之外？当 $S$ 是简单集合（如半空间）时，我们能否设计 $\mathrm{poly}(d/\varepsilon)$ 时间算法？我们通过以下结果在这两个问题上取得进展：1. 针对第一个问题，我们为任何满足特定结构假设的指数族分布以及任何可由 $\ell$ 次多项式 $\varepsilon$ 近似的未知集合 $S$，给出了一个 $d^{\mathrm{poly}(\ell/\varepsilon)}$ 时间算法。该结果有两个重要应用：1a) 首个用于从截断至未知集合 $S$ 的样本中估计任意高斯分布的算法；1b) 首个用于具有未知截断和高斯特征的线性回归的算法。2. 针对第二个问题，我们提出了一个运行时间为 $\mathrm{poly}(d/\varepsilon)$ 的算法，该算法适用于一组指数族分布（包含所有高斯分布），且当 $S$ 是半空间或轴对齐矩形时有效。在此过程中，我们开发了一些可能具有独立价值的工具，包括一种从带正例和未标记样本的 PAC 学习到带正例和负例样本的 PAC 学习的规约方法，该方法对某些协变量偏移具有鲁棒性。