Despite much attention, the comparison of reduced-dimension representations of high-dimensional data remains a challenging problem in multiple fields, especially when representations remain high-dimensional compared to sample size. We offer a framework for evaluating the topological similarity of high-dimensional representations of very high-dimensional data, a regime where topological structure is more likely captured in the distribution of topological "noise" than a few prominent generators. Treating each representational map as a metric embedding, we compute the Vietoris-Rips persistence of its image. We then use the topological bootstrap to analyze the re-sampling stability of each representation, assigning a "prevalence score" for each nontrivial basis element of its persistence module. Finally, we compare the persistent homology of representations using a prevalence-weighted variant of the Wasserstein distance. Notably, our method is able to compare representations derived from different samples of the same distribution and, in particular, is not restricted to comparisons of graphs on the same vertex set. In addition, representations need not lie in the same metric space. We apply this analysis to a cross-sectional sample of representations of functional neuroimaging data in a large cohort and hierarchically cluster under the prevalence-weighted Wasserstein. We find that the ambient dimension of a representation is a stronger predictor of the number and stability of topological features than its decomposition rank. Our findings suggest that important topological information lies in repeatable, low-persistence homology generators, whose distributions capture important and interpretable differences between high-dimensional data representations.
翻译:尽管备受关注,高维数据降维表示的比较在多个领域仍是一个具有挑战性的问题,尤其是当表示维度相对于样本量仍然较高时。我们提出一个框架,用于评估极高维数据的高维表示之间的拓扑相似性——在这种情境下,拓扑结构更可能蕴含于拓扑"噪声"的分布中,而非少数显著生成元。将每个表示映射视为度量嵌入后,我们计算其像的Vietoris-Rips持久性。进而利用拓扑自举法分析各表示的重采样稳定性,为持久模的每个非平凡基元赋予"流行度评分"。最后,我们采用基于流行度加权的Wasserstein距离变体来比较各表示的持久同调。值得注意的是,该方法能够比较来自同一分布不同样本的表示,且不局限于相同顶点集上的图比较。此外,各表示无需位于同一度量空间。我们将此分析应用于大规模队列中功能神经影像数据表示的横截面样本,并在流行度加权Wasserstein距离下进行层次聚类。研究发现,与分解秩相比,表示的空间维度对拓扑特征数量及稳定性的预测能力更强。我们的结果表明,重要的拓扑信息存在于可重复的低持久同调生成元中,其分布捕捉了高维数据表示间重要且可解释的差异。