Efficient Parameter Estimation of Truncated Boolean Product Distributions

We study the problem of estimating the parameters of a Boolean product distribution in $d$ dimensions, when the samples are truncated by a set $S \subset \{0, 1\}^d$ accessible through a membership oracle. This is the first time that the computational and statistical complexity of learning from truncated samples is considered in a discrete setting. We introduce a natural notion of fatness of the truncation set $S$, under which truncated samples reveal enough information about the true distribution. We show that if the truncation set is sufficiently fat, samples from the true distribution can be generated from truncated samples. A stunning consequence is that virtually any statistical task (e.g., learning in total variation distance, parameter estimation, uniformity or identity testing) that can be performed efficiently for Boolean product distributions, can also be performed from truncated samples, with a small increase in sample complexity. We generalize our approach to ranking distributions over $d$ alternatives, where we show how fatness implies efficient parameter estimation of Mallows models from truncated samples. Exploring the limits of learning discrete models from truncated samples, we identify three natural conditions that are necessary for efficient identifiability: (i) the truncation set $S$ should be rich enough; (ii) $S$ should be accessible through membership queries; and (iii) the truncation by $S$ should leave enough randomness in all directions. By carefully adapting the Stochastic Gradient Descent approach of (Daskalakis et al., FOCS 2018), we show that these conditions are also sufficient for efficient learning of truncated Boolean product distributions.

翻译：我们研究在样本被集合 $S \subset \{0, 1\}^d$（可通过成员查询访问）截断的情况下，估计 $d$ 维布尔乘积分布参数的问题。这是首次在离散场景中考虑基于截断样本进行学习的计算与统计复杂度。我们引入了截断集 $S$ 的自然“丰度”概念，在此条件下截断样本能够揭示真实分布的充分信息。我们证明：若截断集足够丰腴，则可以从截断样本中生成来自真实分布的样本。一个显著的推论是，几乎所有能对布尔乘积分布高效执行的统计任务（例如全变差距离学习、参数估计、均匀性或同一性检验）均可通过截断样本完成，且仅需很小的样本复杂度增加。我们将方法推广至 $d$ 个选项上的排序分布，证明丰度性保证了从截断样本中高效估计马洛斯模型参数。为探索从截断样本学习离散模型的极限，我们识别出高效可辨识性所需的三个自然条件：（i）截断集 $S$ 应足够丰富；（ii）$S$ 需可通过成员查询访问；（iii）$S$ 的截断应在所有方向上保留足够随机性。通过精细调整（Daskalakis 等，FOCS 2018）的随机梯度下降方法，我们证明这些条件对截断布尔乘积分布的高效学习也是充分的。