We study the correlation clustering problem in the node-arrival data stream model. Unlike previous work, where the stream consists of the graph's edges, we focus on the setting in which the stream contains only the nodes. This model better reflects many real-world scenarios in which the data stream naturally consists of raw objects (e.g., images, tweets), and the similar/dissimilar edges are derived through a similarity function. We present C$^4$Approx, a streaming algorithm that approximates the cost of correlation clustering using sublinear space in the number of nodes and a constant number of passes. We further complement this result with lower bounds. Experiments on real-world datasets show that by storing only 2% of the nodes, our algorithm achieves performance comparable to the classic Pivot algorithm and the more recent PrunedPivot algorithm, even on sparse graphs.
翻译:我们研究节点到达数据流模型下的相关性聚类问题。与先前工作中流包含图边不同,我们聚焦于流仅包含节点的设定。该模型更能反映许多现实场景——数据流自然包含原始对象(如图像、推文),而相似/不相似边通过相似度函数推导得出。我们提出C$^4$Approx流式算法,该算法使用节点数量的次线性空间和常数遍数近似估算相关性聚类的代价。我们进一步补充该结果的下界。在真实数据集上的实验表明,即使仅存储2%的节点,我们的算法在稀疏图上也能达到经典Pivot算法及较新的PrunedPivot算法相当的性能。