Given all pairwise weights (distances) among a set of objects, filtered graphs provide a sparse representation by only keeping an important subset of weights. Such graphs can be passed to graph clustering algorithms to generate hierarchical clusters. In particular, the directed bubble hierarchical tree (DBHT) algorithm on filtered graphs has been shown to produce good hierarchical clusters for time series data. We propose a new parallel algorithm for constructing triangulated maximally filtered graphs (TMFG), which produces valid inputs for DBHT, and a scalable parallel algorithm for generating DBHTs that is optimized for TMFG inputs. In addition to parallelizing the original TMFG construction, which has limited parallelism, we also design a new algorithm that inserts multiple vertices on each round to enable more parallelism. We show that the graphs generated by our new algorithm have similar quality compared to the original TMFGs, while being much faster to generate. Our new parallel algorithms for TMFGs and DBHTs are 136--2483x faster than state-of-the-art implementations, while achieving up to 41.56x self-relative speedup on 48 cores with hyper-threading, and achieve better clustering results compared to the standard average-linkage and complete-linkage hierarchical clustering algorithms. We show that on a stock data set, our algorithms produce clusters that align well with human experts' classification.
翻译:给定一组对象间的所有成对权重(距离),滤波图通过仅保留重要的权重子集来提供稀疏表示。此类图可传递至图聚类算法以生成层次聚类。特别地,针对时间序列数据,基于滤波图的有向气泡层次树(DBHT)算法已被证明能产生良好的层次聚类结果。我们提出了一种新的并行算法用于构建三角化最大滤波图(TMFG),该算法可为DBHT生成有效输入,同时设计了一种针对TMFG输入优化的可扩展并行DBHT生成算法。除了并行化原有并行度受限的TMFG构建过程,我们还设计了一种新算法,每轮插入多个顶点以提升并行性。实验表明,我们的新算法生成的图与原始TMFG质量相近,但生成速度显著提升。针对TMFG和DBHT的并行新算法比现有最优实现快136–2483倍,在启用超线程的48核处理器上实现了高达41.56倍的自身相对加速,并且相较于标准平均连接和全连接层次聚类算法,取得了更优的聚类结果。在股票数据集上,我们的算法生成的聚类与人类专家分类高度一致。