Understanding the global organization of complicated and high dimensional data is of primary interest for many branches of applied sciences. It is typically achieved by applying dimensionality reduction techniques mapping the considered data into lower dimensional space. This family of methods, while preserving local structures and features, often misses the global structure of the dataset. Clustering techniques are another class of methods operating on the data in the ambient space. They group together points that are similar according to a fixed similarity criteria, however unlike dimensionality reduction techniques, they do not provide information about the global organization of the data. Leveraging ideas from Topological Data Analysis, in this paper we provide an additional layer on the output of any clustering algorithm. Such data structure, ClusterGraph, provides information about the global layout of clusters, obtained from the considered clustering algorithm. Appropriate measures are provided to assess the quality and usefulness of the obtained representation. Subsequently the ClusterGraph, possibly with an appropriate structure--preserving simplification, can be visualized and used in synergy with state of the art exploratory data analysis techniques.
翻译:理解复杂高维数据的全局组织结构是许多应用科学领域的主要关注点。通常通过应用降维技术将所考虑的数据映射到低维空间来实现这一目标。这类方法在保留局部结构和特征的同时,往往忽略了数据集的全局结构。聚类技术是另一类在原始数据空间中操作的方法,它们根据固定的相似性标准将相似的点分组,但与降维技术不同,聚类方法无法提供数据的全局组织信息。本文借鉴拓扑数据分析的思想,在任何聚类算法的输出结果上增加了一个附加层。这种称为ClusterGraph的数据结构能够提供从所采用聚类算法中获得的簇群全局布局信息。我们提出了相应的度量标准来评估所得表征的质量与实用性。随后,ClusterGraph(在必要时可进行保持结构的适当简化)能够实现可视化,并与前沿的探索性数据分析技术协同使用。