Combinatorial Correlation Clustering

Correlation Clustering is a classic clustering objective arising in numerous machine learning and data mining applications. Given a graph $G=(V,E)$, the goal is to partition the vertex set into clusters so as to minimize the number of edges between clusters plus the number of edges missing within clusters. The problem is APX-hard and the best known polynomial time approximation factor is 1.73 by Cohen-Addad, Lee, Li, and Newman [FOCS'23]. They use an LP with $|V|^{1/\epsilon^{\Theta(1)}}$ variables for some small $\epsilon$. However, due to the practical relevance of correlation clustering, there has also been great interest in getting more efficient sequential and parallel algorithms. The classic combinatorial \emph{pivot} algorithm of Ailon, Charikar and Newman [JACM'08] provides a 3-approximation in linear time. Like most other algorithms discussed here, this uses randomization. Recently, Behnezhad, Charikar, Ma and Tan [FOCS'22] presented a $3+\epsilon$-approximate solution for solving problem in a constant number of rounds in the Massively Parallel Computation (MPC) setting. Very recently, Cao, Huang, Su [SODA'24] provided a 2.4-approximation in a polylogarithmic number of rounds in the MPC model and in $\tilde{O} (|E|^{1.5})$ time in the classic sequential setting. They asked whether it is possible to get a better than 3-approximation in near-linear time? We resolve this problem with an efficient combinatorial algorithm providing a drastically better approximation factor. It achieves a $\sim 2-2/13 < 1.847$-approximation in sub-linear ($\tilde O(|V|)$) sequential time or in sub-linear ($\tilde O(|V|)$) space in the streaming setting. In the MPC model, we give an algorithm using only a constant number of rounds that achieves a $\sim 2-1/8 < 1.876$-approximation.

翻译：相关聚类是机器学习和数据挖掘众多应用中一个经典的聚类目标。给定图$G=(V,E)$，其目标是将顶点集划分为若干簇，以最小化簇间边的数量加上簇内缺失边的数量。该问题是APX难的，目前已知的最佳多项式时间近似因子为1.73，由Cohen-Addad、Lee、Li和Newman在[FOCS'23]中提出。他们使用了具有$|V|^{1/\epsilon^{\Theta(1)}}$个变量的线性规划，其中$\epsilon$为小常数。然而，由于相关聚类的实际重要性，人们对于获得更高效的顺序和并行算法也抱有极大兴趣。Ailon、Charikar和Newman在[JACM'08]中提出的经典组合式\emph{pivot}算法能在线性时间内提供3-近似解。与本文讨论的大多数其他算法类似，该算法使用了随机化。最近，Behnezhad、Charikar、Ma和Tan在[FOCS'22]中提出了一种$3+\epsilon$近似解，可在Massively Parallel Computation（MPC）模型中以常数轮数求解该问题。最近，Cao、Huang和Su在[SODA'24]中提出了一种2.4-近似算法，在MPC模型中仅需多对数轮数，在经典顺序设置中仅需$\tilde{O} (|E|^{1.5})$时间。他们提出疑问：是否可能在近线性时间内获得优于3的近似比？我们通过一种高效的组合算法解决了该问题，该算法提供了显著更优的近似因子。它在亚线性（$\tilde O(|V|)$）顺序时间内，或在流式计算模型的亚线性（$\tilde O(|V|)$）空间内，实现了$\sim 2-2/13 < 1.847$的近似比。在MPC模型中，我们提出了一种仅需常数轮数的算法，实现了$\sim 2-1/8 < 1.876$的近似比。