Vector Summaries of Persistence Diagrams for Permutation-based Hypothesis Testing

Over the past decade, the techniques of topological data analysis (TDA) have grown into prominence to describe the shape of data. In recent years, there has been increasing interest in developing statistical methods and in particular hypothesis testing procedures for TDA. Under the statistical perspective, persistence diagrams -- the central multi-scale topological descriptors of data provided by TDA -- are viewed as random observations sampled from some population or process. In this context, one of the earliest works on hypothesis testing focuses on the two-group permutation-based approach where the associated loss function is defined in terms of within-group pairwise bottleneck or Wasserstein distances between persistence diagrams (Robinson and Turner, 2017). However, in situations where persistence diagrams are large in size and number, the permutation test in question gets computationally more costly to apply. To address this limitation, we instead consider pairwise distances between vectorized functional summaries of persistence diagrams for the loss function. In the present work, we explore the utility of the Betti function in this regard, which is one of the simplest function summaries of persistence diagrams. We introduce an alternative vectorization method for the Betti function based on integration and prove stability results with respect to the Wasserstein distance. Moreover, we propose a new shuffling technique of group labels to increase the power of the test. Through several experimental studies, on both synthetic and real data, we show that the vectorized Betti function leads to competitive results compared to the baseline method involving the Wasserstein distances for the permutation test.

翻译：过去十年间，拓扑数据分析（TDA）技术逐渐发展成为描述数据形态的重要手段。近年来，针对TDA的统计方法（特别是假设检验程序）研究日益增多。从统计视角来看，持久性图——作为TDA提供的核心多尺度拓扑数据描述子——被视为从某个总体或过程中采样的随机观测值。在此背景下，最早的假设检验工作之一聚焦于两组置换方法，其损失函数基于组间成对瓶颈距离或Wasserstein距离定义（Robinson and Turner, 2017）。然而，当持久性图规模和数量较大时，该置换检验的计算成本显著增加。为克服这一局限，我们转而采用持久性图向量化函数摘要之间的成对距离作为损失函数。本研究探索了Betti函数在此场景下的实用性——这是最简单的持久性图函数摘要之一。我们提出了一种基于积分的Betti函数替代向量化方法，并证明了其关于Wasserstein距离的稳定性结果。此外，我们引入了一种新的组标签混洗技术以提升检验效能。通过合成数据与真实数据上的多组实验研究表明，与基于Wasserstein距离的基准置换检验方法相比，向量化Betti函数能够产生竞争性结果。