Clustering is a commonplace problem in many areas of data science, with applications in biology and bioinformatics, understanding chemical structure, image segmentation, building recommender systems, and many more fields. While there are many different clustering variants (based on given distance or graph structure, probability distributions, or data density), we consider here the problem of clustering nodes in a graph, motivated by the problem of aggregating discrete degrees of freedom in multigrid and domain decomposition methods for solving sparse linear systems. Specifically, we consider the challenge of forming balanced clusters in the graph of a sparse matrix for use in algebraic multigrid, although the algorithm has general applicability. Based on an extension of the Bellman-Ford algorithm, we generalize Lloyd's algorithm for partitioning subsets of Rn to balance the number of nodes in each cluster; this is accompanied by a rebalancing algorithm that reduces the overall energy in the system. The algorithm provides control over the number of clusters and leads to "well centered" partitions of the graph. Theoretical results are provided to establish linear complexity and numerical results in the context of algebraic multigrid highlight the benefits of improved clustering.
翻译:聚类是数据科学众多领域中的常见问题,应用于生物学与生物信息学、化学结构解析、图像分割、推荐系统构建等多个领域。尽管存在多种聚类变体(基于给定距离或图结构、概率分布或数据密度),本文聚焦于图节点聚类问题,其动机源于求解稀疏线性系统的多重网格与区域分解方法中离散自由度的聚合需求。具体而言,我们针对代数多重网格中稀疏矩阵的图结构,研究如何形成平衡聚类(尽管该算法具有普适性)。基于Bellman-Ford算法的扩展,我们推广了Lloyd算法对Rn子集的分割能力,以平衡各聚类中的节点数量;同时引入再平衡算法以降低系统总能量。该算法可控制聚类数量,并生成图的"良好居中"划分。理论结果证明了线性复杂度,而在代数多重网格背景下的数值结果则凸显了改进聚类带来的优势。