Finding meaningful distances between high-dimensional data samples is an important scientific task. To this end, we propose a new tree-Wasserstein distance (TWD) for high-dimensional data with two key aspects. First, our TWD is specifically designed for data with a latent feature hierarchy, i.e., the features lie in a hierarchical space, in contrast to the usual focus on embedding samples in hyperbolic space. Second, while the conventional use of TWD is to speed up the computation of the Wasserstein distance, we use its inherent tree as a means to learn the latent feature hierarchy. The key idea of our method is to embed the features into a multi-scale hyperbolic space using diffusion geometry and then present a new tree decoding method by establishing analogies between the hyperbolic embedding and trees. We show that our TWD computed based on data observations provably recovers the TWD defined with the latent feature hierarchy and that its computation is efficient and scalable. We showcase the usefulness of the proposed TWD in applications to word-document and single-cell RNA-sequencing datasets, demonstrating its advantages over existing TWDs and methods based on pre-trained models.
翻译:在高维数据样本间寻找有意义的距离度量是一项重要的科学任务。为此,我们提出了一种面向高维数据的新型树-瓦瑟斯坦距离,该方法具有两个关键特征。首先,我们的TWD专门针对具有潜在特征层次结构的数据设计,即特征存在于层次化空间中,这与通常关注将样本嵌入双曲空间的方法形成对比。其次,传统TWD主要用于加速瓦瑟斯坦距离的计算,而我们则利用其内在的树结构作为学习潜在特征层次的手段。本方法的核心思想是通过扩散几何将特征嵌入多尺度双曲空间,随后通过建立双曲嵌入与树结构之间的类比关系,提出一种新的树解码方法。我们证明,基于观测数据计算的TWD能够可证明地恢复由潜在特征层次定义的TWD,且其计算过程高效可扩展。通过在词-文档数据集和单细胞RNA测序数据集上的应用,我们展示了所提出TWD的实用价值,并证明了其相对于现有TWD方法及基于预训练模型方法的优势。