Grouping together similar elements in datasets is a common task in data mining and machine learning. In this paper, we study streaming and parallel algorithms for correlation clustering, where each pair of elements is labeled either similar or dissimilar. The task is to partition the elements and the objective is to minimize disagreements, that is, the number of dissimilar elements grouped together and similar items that get separated. Our main contribution is a semi-streaming algorithm that achieves a $(3 + \varepsilon)$-approximation to the minimum number of disagreements using a single pass over the stream. Our approach builds on the analysis of the PIVOT algorithm by Ailon, Charikar, and Newman [JACM'08] that obtains a $3$-approximation in the centralized setting. Our design allows us to sparsify the input graph by ignoring a large portion of the nodes and edges without a large extra cost as compared to the analysis of PIVOT. This sparsification makes our technique applicable on several models of massive graph processing, such as semi-streaming and Massively Parallel Computing (MPC), where sparse graphs can typically be handled much more efficiently. For the semi-streaming model, our approach yields a single-pass algorithm that works in the adaptive-order setting. This improves on the approximation ratio of the recent single-pass $5$-approximation algorithm and on the number of passes of the recent $O(1/\varepsilon)$-pass $(3 + \varepsilon)$-approximation algorithm [Behnezhad, Charikar, Ma, Tan FOCS'22, SODA'23]. For linear-memory MPC, we get an $O(1)$-round algorithm where the round complexity is independent of $\varepsilon$, which only appears in the memory demand.
翻译:在数据集中将相似元素分组是数据挖掘与机器学习中的常见任务。本文研究面向相关聚类的流式与并行算法,其中每对元素被标记为相似或不相似。任务是将元素进行划分,目标是最小化不一致性,即被分组在一起的不相似元素与被分离的相似元素数量。我们的主要贡献是一种半流算法,该算法通过单遍数据流实现对最小不一致性数量的$(3+\varepsilon)$近似。该方法基于Ailon、Charikar与Newman [JACM'08] 提出的PIVOT算法分析,该算法在集中式场景下可实现3倍近似。我们的设计允许通过忽略输入图中大量节点与边来进行稀疏化处理,相较于PIVOT分析不会产生显著额外代价。这种稀疏化使得我们的技术适用于多种大规模图处理模型(如半流与大规模并行计算)——在这些模型中,稀疏图的处理效率通常更高。针对半流模型,我们的方法提供了一种自适应顺序场景下的单遍算法,改进了近期单遍5倍近似算法的近似比,以及近期$O(1/\varepsilon)$遍$(3+\varepsilon)$近似算法 [Behnezhad, Charikar, Ma, Tan FOCS'22, SODA'23] 的遍数需求。对于线性内存的大规模并行计算,我们得到一种轮复杂度与$\varepsilon$无关的$O(1)$轮算法,其中$\varepsilon$仅影响内存需求。