Unsupervised data representation and visualization using tools from topology is an active and growing field of Topological Data Analysis (TDA) and data science. Its most prominent line of work is based on the so-called Mapper graph, which is a combinatorial graph whose topological structures (connected components, branches, loops) are in correspondence with those of the data itself. While highly generic and applicable, its use has been hampered so far by the manual tuning of its many parameters-among these, a crucial one is the so-called filter: it is a continuous function whose variations on the data set are the main ingredient for both building the Mapper representation and assessing the presence and sizes of its topological structures. However, while a few parameter tuning methods have already been investigated for the other Mapper parameters (i.e., resolution, gain, clustering), there is currently no method for tuning the filter itself. In this work, we build on a recently proposed optimization framework incorporating topology to provide the first filter optimization scheme for Mapper graphs. In order to achieve this, we propose a relaxed and more general version of the Mapper graph, whose convergence properties are investigated. Finally, we demonstrate the usefulness of our approach by optimizing Mapper graph representations on several datasets, and showcasing the superiority of the optimized representation over arbitrary ones.
翻译:利用拓扑学工具进行无监督数据表示与可视化是拓扑数据分析(TDA)及数据科学中一个活跃且不断发展的领域。其最具代表性的工作基于所谓的映射器图——这是一种组合图结构,其拓扑特征(连通分量、分支、环)与数据本身的拓扑结构存在对应关系。尽管该方法具有高度通用性和适用性,但长期以来其应用受限于众多参数的手动调优——其中最关键的是所谓的滤波器:这是一类连续函数,其在数据集上的变化既是构建映射器表示的核心要素,也是评估其拓扑结构存在性与尺度的重要依据。值得注意的是,尽管已有若干参数调优方法针对映射器的其他参数(如分辨率、增益、聚类)展开研究,但目前尚无针对滤波器本身的调优方案。本研究基于近期提出的拓扑优化框架,首次实现了映射器图的滤波器优化方案。为此,我们提出映射器图的松弛泛化版本,并验证其收敛性质。最后,通过在多个数据集上优化映射器图表示的实验,我们展示了该方法相比任意参数设置方案的显著优越性。