Correlation Clustering is a fundamental and widely-studied problem in unsupervised learning and data mining. The input is a graph and the goal is to construct a clustering minimizing the number of inter-cluster edges plus the number of missing intra-cluster edges. CCL+24 introduced the cluster LP for Correlation Clustering, which they argued captures the problem much more succinctly than previous linear programming formulations. However, the cluster LP has exponential size, with a variable for every possible set of vertices in the input graph. Nevertheless, CCL+24 showed how to find a feasible solution for the cluster LP in time $O(n^{\text{poly}(1/\eps)})$ with objective value at most $(1+\epsilon)$ times the value of an optimal solution for the respective Correlation Clustering instance. Furthermore, they showed how to round a solution to the cluster LP, yielding a $(1.437+\eps)$-approximation algorithm for the Correlation Clustering problem. The main technical result of this paper is a new approach to find a feasible solution for the cluster LP with objective value at most $(1+\epsilon)$ of the optimum in time $\widetilde O(2^{\text{poly}(1/\eps)} n)$, where $n$ is the number of vertices in the graph. We also show how to implement the rounding within the same time bounds, thus achieving a fast $(1.437+\eps)$-approximation algorithm for the Correlation Clustering problem. This bridges the gap between state-of-the-art methods for approximating Correlation Clustering and the recent focus on fast algorithms.
翻译:相关性聚类是无监督学习和数据挖掘中的一个基础且被广泛研究的问题。其输入为一个图,目标是构建一个聚类,以最小化簇间边的数量加上簇内缺失边的数量。CCL+24为相关性聚类引入了簇线性规划,他们认为该规划比先前的线性规划表述更简洁地捕捉了问题的本质。然而,簇线性规划的规模是指数级的,其变量对应于输入图中每个可能的顶点集合。尽管如此,CCL+24展示了如何在$O(n^{\text{poly}(1/\eps)})$时间内找到簇线性规划的一个可行解,其目标值至多为相应相关性聚类实例最优解的$(1+\epsilon)$倍。此外,他们还展示了如何对簇线性规划的解进行舍入,从而为相关性聚类问题提供了一个$(1.437+\eps)$-近似算法。本文的主要技术成果是提出了一种新方法,可以在$\widetilde O(2^{\text{poly}(1/\eps)} n)$时间内找到簇线性规划的一个可行解,其目标值至多为最优值的$(1+\epsilon)$倍,其中$n$是图中的顶点数。我们还展示了如何在相同的时间界限内实现舍入操作,从而为相关性聚类问题实现了一个快速的$(1.437+\eps)$-近似算法。这弥合了近似求解相关性聚类的最先进方法与近期对快速算法的关注之间的差距。