Most popular dimension reduction (DR) methods like t-SNE and UMAP are based on minimizing a cost between input and latent pairwise similarities. Though widely used, these approaches lack clear probabilistic foundations to enable a full understanding of their properties and limitations. To that extent, we introduce a unifying statistical framework based on the coupling of hidden graphs using cross entropy. These graphs induce a Markov random field dependency structure among the observations in both input and latent spaces. We show that existing pairwise similarity DR methods can be retrieved from our framework with particular choices of priors for the graphs. Moreover this reveals that these methods suffer from a statistical deficiency that explains poor performances in conserving coarse-grain dependencies. Our model is leveraged and extended to address this issue while new links are drawn with Laplacian eigenmaps and PCA.
翻译:大多数流行的维度缩减方法,如t-SNE和UMAP,都基于最小化输入与潜在成对相似性之间的代价。尽管被广泛使用,但这些方法缺乏清晰的概率基础,难以全面理解其特性与局限。为此,我们引入一个基于交叉熵的隐图耦合的统统计框架。这些图在输入空间和潜在空间中诱导出观测值之间的马尔可夫随机场依赖结构。我们证明,现有的成对相似性维度缩减方法可通过选择特定的图先验从我们的框架中推导得出。此外,这揭示出这些方法存在统计缺陷,导致其在保留粗粒度依赖关系方面表现不佳。我们的模型被利用并扩展以解决这一问题,同时建立了与拉普拉斯特征映射和PCA的新联系。