The Mapper algorithm is a visualization technique in topological data analysis (TDA) that outputs a graph reflecting the structure of a given dataset. However, the Mapper algorithm requires tuning several parameters in order to generate a ``nice" Mapper graph. This paper focuses on selecting the cover parameter. We present an algorithm that optimizes the cover of a Mapper graph by splitting a cover repeatedly according to a statistical test for normality. Our algorithm is based on $G$-means clustering which searches for the optimal number of clusters in $k$-means by iteratively applying the Anderson-Darling test. Our splitting procedure employs a Gaussian mixture model to carefully choose the cover according to the distribution of the given data. Experiments for synthetic and real-world datasets demonstrate that our algorithm generates covers so that the Mapper graphs retain the essence of the datasets, while also running significantly fast.
翻译:马普尔算法是拓扑数据分析中的一种可视化技术,其输出反映给定数据集结构的图形。然而,马普尔算法需要调整多个参数以生成“良好”的马普尔图。本文重点研究覆盖参数选择问题。我们提出一种算法,通过根据正态性统计检验重复分割覆盖来优化马普尔图的覆盖。该算法基于$G$-均值聚类,其通过迭代应用安德森-达林检验在$k$-均值中搜索最优聚类数量。我们的分割过程采用高斯混合模型,根据给定数据的分布仔细选择覆盖。针对合成数据集和真实数据集的实验表明,我们的算法生成的覆盖能使马普尔图保留数据集的本质,同时运行速度显著快。