Consensus clustering (or clustering aggregation) inputs $k$ partitions of a given ground set $V$, and seeks to create a single partition that minimizes disagreement with all input partitions. State-of-the-art algorithms for consensus clustering are based on correlation clustering methods like the popular Pivot algorithm. Unfortunately these methods have not proved to be practical for consensus clustering instances where either $k$ or $V$ gets large. In this paper we provide practical run time improvements for correlation clustering solvers when $V$ is large. We reduce the time complexity of Pivot from $O(|V|^2 k)$ to $O(|V| k)$, and its space complexity from $O(|V|^2)$ to $O(|V| k)$ -- a significant savings since in practice $k$ is much less than $|V|$. We also analyze a sampling method for these algorithms when $k$ is large, bridging the gap between running Pivot on the full set of input partitions (an expected 1.57-approximation) and choosing a single input partition at random (an expected 2-approximation). We show experimentally that algorithms like Pivot do obtain quality clustering results in practice even on small samples of input partitions.
翻译:共识聚类(或聚类聚合)输入给定基础集合 $V$ 的 $k$ 个划分,旨在创建一个与所有输入划分分歧最小化的单一划分。当前最先进的共识聚类算法基于相关聚类方法,如流行的Pivot算法。遗憾的是,当 $k$ 或 $V$ 规模较大时,这些方法在共识聚类实例中并未展现出实用性。本文针对 $V$ 规模较大的场景,提供了相关聚类求解器的实用运行时间改进方案。我们将Pivot算法的时间复杂度从 $O(|V|^2 k)$ 降至 $O(|V| k)$,空间复杂度从 $O(|V|^2)$ 降至 $O(|V| k)$ ——由于实践中 $k$ 远小于 $|V|$,这一改进带来了显著效益。针对 $k$ 较大的情况,我们还分析了一种采样方法,弥合了在全量输入划分上运行Pivot(期望1.57近似比)与随机选取单个输入划分(期望2近似比)之间的差距。实验表明,即使在输入划分的少量样本上,Pivot等算法在实践中仍能获得高质量的聚类结果。