The Mapper algorithm is a visualization technique in topological data analysis (TDA) that outputs a graph reflecting the structure of a given dataset. The Mapper algorithm requires tuning several parameters in order to generate a "nice" Mapper graph. The paper focuses on selecting the cover parameter. We present an algorithm that optimizes the cover of a Mapper graph by splitting a cover repeatedly according to a statistical test for normality. Our algorithm is based on $G$-means clustering which searches for the optimal number of clusters in $k$-means by conducting iteratively the Anderson-Darling test. Our splitting procedure employs a Gaussian mixture model in order to choose carefully the cover based on the distribution of a given data. Experiments for synthetic and real-world datasets demonstrate that our algorithm generates covers so that the Mapper graphs retain the essence of the datasets.
翻译:马普尔算法是拓扑数据分析(TDA)中的一种可视化技术,它输出一个反映给定数据集结构的图。马普尔算法需要调整多个参数以生成“良好”的马普尔图。本文聚焦于选择覆盖参数。我们提出了一种算法,该算法根据正态性统计检验反复划分覆盖,从而优化马普尔图的覆盖。我们的算法基于$G$-均值聚类,它通过迭代执行安德森-达林检验来搜索$k$-均值中的最优聚类数量。我们的划分过程采用高斯混合模型,以便根据给定数据的分布仔细选择覆盖。在合成数据集和真实世界数据集上的实验表明,我们的算法生成的覆盖能使马普尔图保留数据集的本质。