Sublinear Algorithms for Estimating Single-Linkage Clustering Costs

Single-linkage clustering is a fundamental method for data analysis. Algorithmically, one can compute a single-linkage $k$-clustering (a partition into $k$ clusters) by computing a minimum spanning tree and dropping the $k-1$ most costly edges. This clustering minimizes the sum of spanning tree weights of the clusters. This motivates us to define the cost of a single-linkage $k$-clustering as the weight of the corresponding spanning forest, denoted by $\mathrm{cost}_k$. Besides, if we consider single-linkage clustering as computing a hierarchy of clusterings, the total cost of the hierarchy is defined as the sum of the individual clusterings, denoted by $\mathrm{cost}(G) = \sum_{k=1}^{n} \mathrm{cost}_k$. In this paper, we assume that the distances between data points are given as a graph $G$ with average degree $d$ and edge weights from $\{1,\dots, W\}$. Given query access to the adjacency list of $G$, we present a sampling-based algorithm that computes a succinct representation of estimates $\widehat{\mathrm{cost}}_k$ for all $k$. The running time is $\tilde O(d\sqrt{W}/\varepsilon^3)$, and the estimates satisfy $\sum_{k=1}^{n} |\widehat{\mathrm{cost}}_k - \mathrm{cost}_k| \le \varepsilon\cdot \mathrm{cost}(G)$, for any $0<\varepsilon <1$. Thus we can approximate the cost of every $k$-clustering upto $(1+\varepsilon)$ factor \emph{on average}. In particular, our result ensures that we can estimate $\cost(G)$ upto a factor of $1\pm \varepsilon$ in the same running time. We also extend our results to the setting where edges represent similarities. In this case, the clusterings are defined by a maximum spanning tree, and our algorithms run in $\tilde{O}(dW/\varepsilon^3)$ time. We futher prove nearly matching lower bounds for estimating the total clustering cost and we extend our algorithms to metric space settings.

翻译：单链接聚类是一种基础的数据分析方法。从算法角度，可以通过计算最小生成树并删除代价最高的 $k-1$ 条边来计算单链接 $k$ 聚类（即划分为 $k$ 个簇）。这种聚类方式最小化了各簇生成树权重之和。这促使我们将单链接 $k$ 聚类的成本定义为对应生成森林的权重，记为 $\mathrm{cost}_k$。此外，若将单链接聚类视为计算聚类层次结构，则层次结构的总成本定义为各聚类成本之和，记为 $\mathrm{cost}(G) = \sum_{k=1}^{n} \mathrm{cost}_k$。本文假设数据点间的距离以平均度为 $d$、边权重取自 $\{1,\dots, W\}$ 的图 $G$ 给出。在拥有对 $G$ 邻接表的查询访问权限的前提下，我们提出一种基于采样的算法，可为所有 $k$ 值计算估计值 $\widehat{\mathrm{cost}}_k$ 的简洁表示。算法运行时间为 $\tilde O(d\sqrt{W}/\varepsilon^3)$，且估计值满足 $\sum_{k=1}^{n} |\widehat{\mathrm{cost}}_k - \mathrm{cost}_k| \le \varepsilon\cdot \mathrm{cost}(G)$（对任意 $0<\varepsilon <1$）。因此我们能够以 $(1+\varepsilon)$ 因子在平均意义上近似每个 $k$ 聚类的成本。特别地，我们的结果保证可在相同运行时间内以 $1\pm \varepsilon$ 因子估计 $\cost(G)$。我们还将结果扩展到边表示相似度的场景：此时聚类由最大生成树定义，算法运行时间为 $\tilde{O}(dW/\varepsilon^3)$。我们进一步证明了估计总聚类成本的近乎匹配的下界，并将算法扩展至度量空间场景。