We consider the problem of identifying, from statistics, a distribution of discrete random variables $X_1,\ldots,X_n$ that is a mixture of $k$ product distributions. The best previous sample complexity for $n \in O(k)$ was $(1/\zeta)^{O(k^2 \log k)}$ (under a mild separation assumption parameterized by $\zeta$). The best known lower bound was $\exp(\Omega(k))$. It is known that $n\geq 2k-1$ is necessary and sufficient for identification. We show, for any $n\geq 2k-1$, how to achieve sample complexity and run-time complexity $(1/\zeta)^{O(k)}$. We also extend the known lower bound of $e^{\Omega(k)}$ to match our upper bound across a broad range of $\zeta$. Our results are obtained by combining (a) a classic method for robust tensor decomposition, (b) a novel way of bounding the condition number of key matrices called Hadamard extensions, by studying their action only on flattened rank-1 tensors.
翻译:我们研究从统计量中识别离散随机变量 $X_1,\ldots,X_n$ 分布的问题,该分布是 $k$ 个乘积分布的混合体。在 $n \in O(k)$ 条件下,先前最优的样本复杂度为 $(1/\zeta)^{O(k^2 \log k)}$(在由 $\zeta$ 参数化的温和分离假设下)。已知最优下界为 $\exp(\Omega(k))$,且识别问题的必要充分条件为 $n\geq 2k-1$。我们证明,对任意 $n\geq 2k-1$,可实现的样本复杂度与运行时间复杂度均为 $(1/\zeta)^{O(k)}$。此外,我们将已知下界 $e^{\Omega(k)}$ 扩展至与上界在广泛 $\zeta$ 范围内匹配。本文成果通过结合以下两个方法获得:(a) 经典鲁棒张量分解方法;(b) 一种通过研究关键矩阵——哈达玛扩展仅在展平秩-1张量上的作用来界定其条件数的新颖途径。