We study the problem of learning nonparametric distributions in a finite mixture, and establish tight bounds on the sample complexity for learning the component distributions in such models. Namely, we are given i.i.d. samples from a pdf $f$ where $$ f=w_1f_1+w_2f_2, \quad w_1+w_2=1, \quad w_1,w_2>0 $$ and we are interested in learning each component $f_i$. Without any assumptions on $f_i$, this problem is ill-posed. In order to identify the components $f_i$, we assume that each $f_i$ can be written as a convolution of a Gaussian and a compactly supported density $\nu_i$ with $\text{supp}(\nu_1)\cap \text{supp}(\nu_2)=\emptyset$. Our main result shows that $(\frac{1}{\varepsilon})^{\Omega(\log\log \frac{1}{\varepsilon})}$ samples are required for estimating each $f_i$. The proof relies on a quantitative Tauberian theorem that yields a fast rate of approximation with Gaussians, which may be of independent interest. To show this is tight, we also propose an algorithm that uses $(\frac{1}{\varepsilon})^{O(\log\log \frac{1}{\varepsilon})}$ samples to estimate each $f_i$. Unlike existing approaches to learning latent variable models based on moment-matching and tensor methods, our proof instead involves a delicate analysis of an ill-conditioned linear system via orthogonal functions. Combining these bounds, we conclude that the optimal sample complexity of this problem properly lies in between polynomial and exponential, which is not common in learning theory.
翻译:我们研究有限混合中非参数分布的学习问题,并建立了学习此类模型中成分分布的样本复杂度的紧界。具体而言,我们给定来自概率密度函数$f$的独立同分布样本,其中$$ f=w_1f_1+w_2f_2, \quad w_1+w_2=1, \quad w_1,w_2>0 $$且我们关注于学习每个成分$f_i$。若对$f_i$不作任何假设,则该问题是不适定的。为识别成分$f_i$,我们假设每个$f_i$可表示为高斯分布与紧支撑密度$\nu_i$的卷积,且满足$\text{supp}(\nu_1)\cap \text{supp}(\nu_2)=\emptyset$。我们的主要结果表明,估计每个$f_i$需要$(\frac{1}{\varepsilon})^{\Omega(\log\log \frac{1}{\varepsilon})}$个样本。该证明依赖于一个定量Tauberian定理,该定理给出了高斯逼近的快速收敛速率,可能具有独立的理论价值。为证明此界是紧的,我们进一步提出一种算法,该算法使用$(\frac{1}{\varepsilon})^{O(\log\log \frac{1}{\varepsilon})}$个样本即可估计每个$f_i$。不同于现有基于矩匹配和张量方法的隐变量模型学习方法,我们的证明转而通过正交函数对一个病态线性系统进行精细分析。结合这些界,我们得出结论:该问题的最优样本复杂度恰好介于多项式与指数之间,这在学习理论中并不常见。