We introduce TeraHAC, a $(1+\epsilon)$-approximate hierarchical agglomerative clustering (HAC) algorithm which scales to trillion-edge graphs. Our algorithm is based on a new approach to computing $(1+\epsilon)$-approximate HAC, which is a novel combination of the nearest-neighbor chain algorithm and the notion of $(1+\epsilon)$-approximate HAC. Our approach allows us to partition the graph among multiple machines and make significant progress in computing the clustering within each partition before any communication with other partitions is needed. We evaluate TeraHAC on a number of real-world and synthetic graphs of up to 8 trillion edges. We show that TeraHAC requires over 100x fewer rounds compared to previously known approaches for computing HAC. It is up to 8.3x faster than SCC, the state-of-the-art distributed algorithm for hierarchical clustering, while achieving 1.16x higher quality. In fact, TeraHAC essentially retains the quality of the celebrated HAC algorithm while significantly improving the running time.
翻译:我们提出了TeraHAC,一种可扩展至万亿边图的$(1+\epsilon)$近似层次凝聚聚类(HAC)算法。我们的算法基于一种计算$(1+\epsilon)$近似HAC的新方法,该方法创新性地结合了最近邻链算法与$(1+\epsilon)$近似HAC的概念。我们的方法允许将图划分到多台机器上,并在需要与其他分区进行通信之前,在每个分区内部显著推进聚类计算。我们在多达8万亿条边的真实世界图和合成图上评估了TeraHAC。实验表明,与先前已知的HAC计算方法相比,TeraHAC所需轮次减少了超过100倍。相较于当前最先进的层次聚类分布式算法SCC,TeraHAC的速度提升了高达8.3倍,同时获得了1.16倍的质量提升。事实上,TeraHAC在显著提升运行时间的同时,基本保持了经典HAC算法的质量。