The Wasserstein distance between mixing measures has come to occupy a central place in the statistical analysis of mixture models. This work proposes a new canonical interpretation of this distance and provides tools to perform inference on the Wasserstein distance between mixing measures in topic models. We consider the general setting of an identifiable mixture model consisting of mixtures of distributions from a set $\mathcal{A}$ equipped with an arbitrary metric $d$, and show that the Wasserstein distance between mixing measures is uniquely characterized as the most discriminative convex extension of the metric $d$ to the set of mixtures of elements of $\mathcal{A}$. The Wasserstein distance between mixing measures has been widely used in the study of such models, but without axiomatic justification. Our results establish this metric to be a canonical choice. Specializing our results to topic models, we consider estimation and inference of this distance. Though upper bounds for its estimation have been recently established elsewhere, we prove the first minimax lower bounds for the estimation of the Wasserstein distance in topic models. We also establish fully data-driven inferential tools for the Wasserstein distance in the topic model context. Our results apply to potentially sparse mixtures of high-dimensional discrete probability distributions. These results allow us to obtain the first asymptotically valid confidence intervals for the Wasserstein distance in topic models.
翻译:混合测度之间的Wasserstein距离在混合模型的统计分析中占据核心地位。本文提出该距离的一种新规范诠释,并提供在主题模型中对混合测度间Wasserstein距离进行推断的工具。我们考虑一般可识别混合模型场景,该模型由来自集合$\mathcal{A}$的分布混合物构成,且$\mathcal{A}$配备任意度量$d$,并证明混合测度间的Wasserstein距离可唯一地表征为度量$d$到$\mathcal{A}$元素混合集合上最具区分性的凸扩展。尽管Wasserstein距离已广泛应用于此类模型研究,但缺乏公理化依据。我们的结果确立了该度量的规范选择。将结果特化至主题模型,我们研究该距离的估计与推断问题。尽管近期已有文献建立了该距离估计的上界,我们首次证明了主题模型中Wasserstein距离估计的极小最大下界。同时,我们建立了完全数据驱动的主题模型Wasserstein距离推断工具。我们的结果适用于高维离散概率分布的潜在稀疏混合物。这些结果使我们能够首次获得主题模型中Wasserstein距离的渐近有效置信区间。