The distance function to a compact set plays a crucial role in the paradigm of topological data analysis. In particular, the sublevel sets of the distance function are used in the computation of persistent homology -- a backbone of the topological data analysis pipeline. Despite its stability to perturbations in the Hausdorff distance, persistent homology is highly sensitive to outliers. In this work, we develop a framework of statistical inference for persistent homology in the presence of outliers. Drawing inspiration from recent developments in robust statistics, we propose a \textit{median-of-means} variant of the distance function (\textsf{MoM Dist}) and establish its statistical properties. In particular, we show that, even in the presence of outliers, the sublevel filtrations and weighted filtrations induced by \textsf{MoM Dist} are both consistent estimators of the true underlying population counterpart and exhibit near minimax-optimal performance in adversarial settings. Finally, we demonstrate the advantages of the proposed methodology through simulations and applications.
翻译:紧集的距离函数在拓扑数据分析范式中扮演着关键角色。特别地,距离函数的子水平集被用于计算持续同调——拓扑数据分析流程的核心支柱。尽管持续同调对 Hausdorff 距离下的扰动具有稳定性,但其对异常值高度敏感。在本工作中,我们建立了一个在异常值存在情况下进行持续同调统计推断的框架。受稳健统计学最新进展的启发,我们提出了一种距离函数的\textit{均值中位数}变体(\textsf{MoM Dist})并确立了其统计性质。具体而言,我们证明即使在异常值存在的情况下,由 \textsf{MoM Dist} 诱导的子水平滤过和加权滤过,既是真实潜在总体对应物的相合估计量,又在对抗性环境中展现出接近极小极大最优的性能。最后,我们通过模拟与应用展示了所提方法的优势。