Grouping together similar elements in datasets is a common task in data mining and machine learning. In this paper, we study streaming and parallel algorithms for correlation clustering, where each pair of elements is labeled either similar or dissimilar. The task is to partition the elements and the objective is to minimize disagreements, that is, the number of dissimilar elements grouped together and similar elements that get separated. Our main contribution is a semi-streaming algorithm that achieves a $(3 + \varepsilon)$-approximation to the minimum number of disagreements using a single pass over the stream. In addition, the algorithm also works for dynamic streams. Our approach builds on the analysis of the PIVOT algorithm by Ailon, Charikar, and Newman [JACM'08] that obtains a $3$-approximation in the centralized setting. Our design allows us to sparsify the input graph by ignoring a large portion of the nodes and edges without a large extra cost as compared to the analysis of PIVOT. This sparsification makes our technique applicable in several models of massive graph processing, such as semi-streaming and Massively Parallel Computing (MPC), where sparse graphs can typically be handled much more efficiently. Our work improves on the approximation ratio of the recent single-pass $5$-approximation algorithm and on the number of passes of the recent $O(1/\varepsilon)$-pass $(3 + \varepsilon)$-approximation algorithm [Behnezhad, Charikar, Ma, Tan FOCS'22, SODA'23]. Our algorithm is also more robust and can be applied in dynamic streams. Furthermore, it is the first single pass $(3 + \varepsilon)$-approximation algorithm that uses polynomial post-processing time.
翻译:数据集中的相似元素分组是数据挖掘与机器学习中的常见任务。本文研究了面向相关聚类的流式与并行算法,其中每对元素被标记为相似或不相似。任务在于划分元素,优化目标是最小化不一致项——即被归为同组的不相似元素数量与被分开的相似元素数量。我们的主要贡献是一种半流式算法,该算法可在单遍数据流中实现对最小不一致项数量的 $(3 + \varepsilon)$-近似。此外,该算法同样适用于动态流。我们的方法基于Ailon、Charikar和Newman [JACM'08]提出的PIVOT算法分析,该算法在集中式场景中获得了3-近似。我们的设计允许通过忽略大量节点与边来稀疏化输入图,与PIVOT分析相比不会产生显著额外成本。这种稀疏化使得我们的技术可应用于大规模图处理的多种模型(如半流式与大规模并行计算),其中稀疏图通常能更高效地处理。我们的工作改进了近期单遍5-近似算法的近似比,以及近期 $O(1/\varepsilon)$-遍 $(3 + \varepsilon)$-近似算法 [Behnezhad, Charikar, Ma, Tan FOCS'22, SODA'23] 的遍历次数。该算法更具鲁棒性且适用于动态流,同时也是首个采用多项式后处理时间的单遍 $(3 + \varepsilon)$-近似算法。