Persistence diagrams (PD)s play a central role in topological data analysis, and are used in an ever increasing variety of applications. The comparison of PD data requires computing comparison metrics among large sets of PDs, with metrics which are accurate, theoretically sound, and fast to compute. Especially for denser multi-dimensional PDs, such comparison metrics are lacking. While on the one hand, Wasserstein-type distances have high accuracy and theoretical guarantees, they incur high computational cost. On the other hand, distances between vectorizations such as Persistence Statistics (PS)s have lower computational cost, but lack the accuracy guarantees and in general they are not guaranteed to distinguish PDs (i.e. the two PS vectors of different PDs may be equal). In this work we introduce a class of pseudodistances called Extended Topological Pseudodistances (ETD)s, which have tunable complexity, and can approximate Sliced and classical Wasserstein distances at the high-complexity extreme, while being computationally lighter and close to Persistence Statistics at the lower complexity extreme, and thus allow users to interpolate between the two metrics. We build theoretical comparisons to show how to fit our new distances at an intermediate level between persistence vectorizations and Wasserstein distances. We also experimentally verify that ETDs outperform PSs in terms of accuracy and outperform Wasserstein and Sliced Wasserstein distances in terms of computational complexity.
翻译:持久性图在拓扑数据分析中发挥着核心作用,并越来越多地应用于各类场景。比较持久性图数据需要计算大规模持久性图集合之间的比较度量,要求这些度量兼具准确性、理论完备性和计算高效性。尤其对于高密度多维持久性图,目前尚缺乏这类比较度量。一方面,Wasserstein型距离具有高精度和理论保证,但计算成本高昂;另一方面,持久性统计量等向量化方法之间的距离计算成本较低,但缺乏准确性保证,且通常无法区分不同的持久性图(即不同持久性图的两个持久性统计量向量可能相等)。本文提出一类称为扩展拓扑伪距离的可调复杂度伪距离:在高复杂度极端情况下可近似切片Wasserstein距离和经典Wasserstein距离,在低复杂度极端情况下计算更轻量且接近持久性统计量,从而允许用户在两种度量之间进行插值。我们建立了理论比较,证明新距离可适配于持久性向量化与Wasserstein距离之间的中间层级。实验验证表明:扩展拓扑伪距离在准确性上优于持久性统计量,在计算复杂度上优于Wasserstein距离和切片Wasserstein距离。