We consider the problem of maintaining a hierarchical agglomerative clustering (HAC) in the dynamic setting, when the input is subject to point insertions and deletions. We introduce DynHAC - the first dynamic HAC algorithm for the popular average-linkage version of the problem which can maintain a 1+\epsilon approximate solution. Our approach leverages recent structural results on (1+\epsilon)-approximate HAC to carefully identify the part of the clustering dendrogram that needs to be updated in order to produce a solution that is consistent with what a full recomputation from scratch would have output. We evaluate DynHAC on a number of real-world graphs. We show that DynHAC can handle each update up to 423x faster than what it would take to recompute the clustering from scratch. At the same time it achieves up to 0.21 higher NMI score than the state-of-the-art dynamic hierarchical clustering algorithms, which do not provably approximate HAC.
翻译:本文研究了在动态环境下维护层次凝聚聚类(HAC)的问题,其中输入数据会面临点的插入与删除操作。我们提出了DynHAC——首个针对该问题中流行的平均链接版本设计的动态HAC算法,能够维护一个1+ε近似解。我们的方法利用近期关于(1+ε)近似HAC的结构性结果,精心识别聚类树状图中需要更新的部分,以生成与完全重新计算所得输出一致的解。我们在多个真实世界图上对DynHAC进行了评估。实验表明,DynHAC处理每次更新的速度可比从头重新计算聚类快达423倍。同时,其NMI分数比当前最先进的动态层次聚类算法(这些算法无法在理论上保证近似HAC)高出最多0.21。