We give a new algorithm for learning mixtures of $k$ Gaussians (with identity covariance in $\mathbb{R}^n$) to TV error $\varepsilon$, with quasi-polynomial ($O(n^{\text{poly\,log}\left(\frac{n+k}{\varepsilon}\right)})$) time and sample complexity, under a minimum weight assumption. Our results extend to continuous mixtures of Gaussians where the mixing distribution is supported on a union of $k$ balls of constant radius. In particular, this applies to the case of Gaussian convolutions of distributions on low-dimensional manifolds, or more generally sets with small covering number, for which no sub-exponential algorithm was previously known. Unlike previous approaches, most of which are algebraic in nature, our approach is analytic and relies on the framework of diffusion models. Diffusion models are a modern paradigm for generative modeling, which typically rely on learning the score function (gradient log-pdf) along a process transforming a pure noise distribution, in our case a Gaussian, to the data distribution. Despite their dazzling performance in tasks such as image generation, there are few end-to-end theoretical guarantees that they can efficiently learn nontrivial families of distributions; we give some of the first such guarantees. We proceed by deriving higher-order Gaussian noise sensitivity bounds for the score functions for a Gaussian mixture to show that that they can be inductively learned using piecewise polynomial regression (up to poly-logarithmic degree), and combine this with known convergence results for diffusion models.
翻译:我们提出了一种新算法,用于学习$k$个高斯分布(在$\mathbb{R}^n$中具有单位协方差)的混合模型,达到TV误差$\varepsilon$,在最小权重假设下具有拟多项式($O(n^{\text{poly\,log}\left(\frac{n+k}{\varepsilon}\right)})$)时间和样本复杂度。我们的结果可推广到高斯连续混合的情形,其中混合分布支撑在$k$个恒定半径球的并集上。特别地,这适用于低维流形上分布的高斯卷积,或更一般地适用于覆盖数较小的集合,此前对此类情形尚未已知任何亚指数时间算法。与以往大多基于代数方法的研究不同,我们的方法是解析性的,并依赖于扩散模型的框架。扩散模型是现代生成建模的范式,通常依赖于学习沿某个过程的得分函数(对数概率密度梯度),该过程将纯噪声分布(在我们的案例中为高斯分布)转换为数据分布。尽管扩散模型在图像生成等任务中表现出令人瞩目的性能,但关于其能否高效学习非平凡分布族的端到端理论保证尚少;我们给出了首批此类保证之一。我们通过推导高斯混合得分函数的高阶高斯噪声敏感度界,证明其可通过分段多项式回归(最高至多对数阶)进行归纳学习,并将此与扩散模型的已知收敛结果相结合。