The Mapper algorithm is a visualization technique in topological data analysis (TDA) that outputs a graph reflecting the structure of a given dataset. However, the Mapper algorithm requires tuning several parameters in order to generate a ``nice" Mapper graph. This paper focuses on selecting the cover parameter. We present an algorithm that optimizes the cover of a Mapper graph by splitting a cover repeatedly according to a statistical test for normality. Our algorithm is based on $G$-means clustering which searches for the optimal number of clusters in $k$-means by iteratively applying the Anderson-Darling test. Our splitting procedure employs a Gaussian mixture model to carefully choose the cover according to the distribution of the given data. Experiments for synthetic and real-world datasets demonstrate that our algorithm generates covers so that the Mapper graphs retain the essence of the datasets, while also running significantly fast.
翻译:Mapper算法是拓扑数据分析(TDA)中的一种可视化技术,它能输出反映给定数据集结构的图。然而,Mapper算法需要调整多个参数才能生成“良好”的Mapper图。本文重点关注覆盖参数的选择。我们提出了一种算法,通过根据正态性统计检验重复分割覆盖来优化Mapper图的覆盖。我们的算法基于$G$-均值聚类,后者通过迭代应用Anderson-Darling检验来搜索$k$-均值中的最优聚类数。我们的分割过程采用高斯混合模型,根据给定数据的分布谨慎选择覆盖。合成数据集和真实数据集的实验表明,我们的算法生成的覆盖能使Mapper图保留数据集的本质特征,同时运行速度显著更快。