Local Graph Clustering with Noisy Labels

The growing interest in machine learning problems over graphs with additional node information such as texts, images, or labels has popularized methods that require the costly operation of processing the entire graph. Yet, little effort has been made to the development of fast local methods (i.e. without accessing the entire graph) that extract useful information from such data. To that end, we propose a study of local graph clustering using noisy node labels as a proxy for additional node information. In this setting, nodes receive initial binary labels based on cluster affiliation: 1 if they belong to the target cluster and 0 otherwise. Subsequently, a fraction of these labels is flipped. We investigate the benefits of incorporating noisy labels for local graph clustering. By constructing a weighted graph with such labels, we study the performance of graph diffusion-based local clustering method on both the original and the weighted graphs. From a theoretical perspective, we consider recovering an unknown target cluster with a single seed node in a random graph with independent noisy node labels. We provide sufficient conditions on the label noise under which, with high probability, using diffusion in the weighted graph yields a more accurate recovery of the target cluster. This approach proves more effective than using the given labels alone or using diffusion in the label-free original graph. Empirically, we show that reliable node labels can be obtained with just a few samples from an attributed graph. Moreover, utilizing these labels via diffusion in the weighted graph leads to significantly better local clustering performance across several real-world datasets, improving F1 scores by up to 13%.

翻译：随着图机器学习问题中附加节点信息（如文本、图像或标签）日益受到关注，基于全图处理的高成本操作的方法得到普及。然而，开发能够从这类数据中提取有用信息的快速局部方法（即无需访问全图）的研究仍显不足。为此，我们提出利用含噪节点标签作为附加节点信息代理的局部图聚类研究。在该设定中，节点根据簇隶属关系获得初始二元标签：若属于目标簇则标记为1，否则为0。随后，部分标签被翻转。我们探究了将噪声标签融入局部图聚类的优势。通过构建基于此类标签的加权图，我们研究了基于图扩散的局部聚类方法在原始图和加权图上的性能。从理论视角出发，我们考虑在具有独立含噪节点标签的随机图中，利用单个种子节点恢复未知目标簇的问题。我们给出了标签噪声的充分条件，在此条件下，高概率地使用加权图扩散能更准确地恢复目标簇，且该方法优于单独使用给定标签或在无标签原始图上进行扩散。实验表明，仅需从属性图中采样少量样本即可获得可靠节点标签。此外，通过加权图扩散利用这些标签，在多个真实数据集上显著提升了局部聚类性能，F1分数最高提升13%。