In this work, we investigate Gaussian Mixture Models ({\it abbrv} GMM) and the related problem of non parametric maximum likelihood estimation ({\it abbrv} NPMLE) from the perspective of statistical mechanics. In particular, we establish stability guarantees for the NPMLE procedure that extend well beyond the state of the art. Crucially, we obtain guarantees on the Kullback-Leibler divergence between NPMLE estimators and the ground truth, a type of result which has been known to be challenging in the literature on this problem. In particular, we provide high probability upper bounds on the KL divergence between the NPMLE and the true density that are of the order of $\min\big\{\frac{(\log n)^{d+2}}{n} , \frac{\log n}{\sqrt n}\big\}$, which cover a wide range of scenarios for the comparative sizes of $n$ and $d$. We obtain similar guarantees for approximate solutions to the NPMLE problem, addressing realistic situations wherein optimization algorithms need to be stopped in finite time, allowing access only to approximations to the true NPMLE. A technical cornerstone of our approach is an analysis of the function class complexity of logarithms of gaussian mixture densities, which is able to handle their unboundedness, and could be of wider interest. We also establish correspondences between stability phenomena in the NPMLE problem and concepts from chaos and multiple valleys in random energy landscapes of statistical mechanics models. We believe that these correspondences may be useful for a wide variety of random optimization problems in statistics and machine learning, especially the connections to the the technical ingredients of concentration phenomena and Langevin dynamics for these models.
翻译:本文从统计力学角度研究高斯混合模型(简称GMM)及相关的非参数最大似然估计问题(简称NPMLE)。我们为NPMLE过程建立了显著超越现有技术水平的稳定性保证。关键在于,我们获得了NPMLE估计量与真实分布之间Kullback-Leibler散度的保证——这类结果在该问题文献中公认具有挑战性。具体而言,我们给出了NPMLE与真实密度之间KL散度的高概率上界,其阶数为$\min\big\{\frac{(\log n)^{d+2}}{n} , \frac{\log n}{\sqrt n}\big\}$,覆盖了$n$和$d$相对大小的广泛场景。对于NPMLE问题的近似解,我们获得了类似保证,这适用于优化算法需在有限时间内终止、仅能获取真实NPMLE近似解的现实情形。我们方法的技术基石是对高斯混合密度对数函数类复杂度的分析,该分析能够处理其无界性,可能具有更广泛的应用价值。我们还建立了NPMLE问题中的稳定性现象与统计力学模型中随机能量景观的混沌及多谷概念之间的对应关系。我们相信这些对应关系对统计学和机器学习中的各类随机优化问题具有重要参考价值,尤其是与这些模型中浓度现象和朗之万动力学技术要点的关联。