We introduce TeraHAC, a $(1+\epsilon)$-approximate hierarchical agglomerative clustering (HAC) algorithm which scales to trillion-edge graphs. Our algorithm is based on a new approach to computing $(1+\epsilon)$-approximate HAC, which is a novel combination of the nearest-neighbor chain algorithm and the notion of $(1+\epsilon)$-approximate HAC. Our approach allows us to partition the graph among multiple machines and make significant progress in computing the clustering within each partition before any communication with other partitions is needed. We evaluate TeraHAC on a number of real-world and synthetic graphs of up to 8 trillion edges. We show that TeraHAC requires over 100x fewer rounds compared to previously known approaches for computing HAC. It is up to 8.3x faster than SCC, the state-of-the-art distributed algorithm for hierarchical clustering, while achieving 1.16x higher quality. In fact, TeraHAC essentially retains the quality of the celebrated HAC algorithm while significantly improving the running time.
翻译:我们提出TeraHAC,一种$(1+\epsilon)$-近似的分层凝聚聚类(HAC)算法,可扩展至万亿边图谱。该算法基于一种计算$(1+\epsilon)$-近似HAC的新方法,创新性地结合了最近邻链算法与$(1+\epsilon)$-近似HAC的概念。我们的方法允许将图谱划分至多台机器,并在需要跨分区通信前,在每个分区内显著推进聚类计算。我们在包含高达8万亿条边的多个真实世界与合成图谱上评估TeraHAC。结果表明,与现有HAC计算方法相比,TeraHAC所需轮次减少逾100倍。相较于当前最先进的分布式分层聚类算法SCC,其速度提升最高达8.3倍,同时聚类质量提高1.16倍。实际上,TeraHAC在显著提升运行时间的同时,基本保留了经典HAC算法的聚类质量。