Neighbor embedding methods $t$-SNE and UMAP are the de facto standard for visualizing high-dimensional datasets. Motivated from entirely different viewpoints, their loss functions appear to be unrelated. In practice, they yield strongly differing embeddings and can suggest conflicting interpretations of the same data. The fundamental reasons for this and, more generally, the exact relationship between $t$-SNE and UMAP have remained unclear. In this work, we uncover their conceptual connection via a new insight into contrastive learning methods. Noise-contrastive estimation can be used to optimize $t$-SNE, while UMAP relies on negative sampling, another contrastive method. We find the precise relationship between these two contrastive methods and provide a mathematical characterization of the distortion introduced by negative sampling. Visually, this distortion results in UMAP generating more compact embeddings with tighter clusters compared to $t$-SNE. We exploit this new conceptual connection to propose and implement a generalization of negative sampling, allowing us to interpolate between (and even extrapolate beyond) $t$-SNE and UMAP and their respective embeddings. Moving along this spectrum of embeddings leads to a trade-off between discrete / local and continuous / global structures, mitigating the risk of over-interpreting ostensible features of any single embedding. We provide a PyTorch implementation.
翻译:邻域嵌入方法 $t$-SNE 和 UMAP 是可视化高维数据集的业界标准。尽管它们源于完全不同的视角,但其损失函数看似毫无关联。实际应用中,两者会产生显著不同的嵌入结果,并可能对同一数据提出相互矛盾的解释。造成这一现象的根本原因,以及 $t$-SNE 与 UMAP 之间的确切关系,至今仍不明确。在本工作中,我们通过对比学习方法的新视角揭示了二者的概念联系。噪声对比估计可用于优化 $t$-SNE,而 UMAP 则依赖另一种对比方法——负采样。我们找到了这两种对比方法之间的精确关系,并对负采样引入的畸变提供了数学刻画。从视觉上看,这种畸变导致 UMAP 相比 $t$-SNE 生成更紧凑的嵌入,且聚类边界更清晰。我们利用这一新的概念联系,提出并实现了一种负采样的推广形式,使得我们可以在 $t$-SNE 与 UMAP(及其各自嵌入结果)之间进行插值,甚至向外推。沿嵌入谱系移动会在离散/局域结构与连续/全局结构之间产生权衡,从而降低过度解读单一嵌入表面特征的风险。我们提供了基于 PyTorch 的实现。