The growing interest in machine learning problems over graphs with additional node information such as texts, images, or labels has popularized methods that require the costly operation of processing the entire graph. Yet, little effort has been made to the development of fast local methods (i.e. without accessing the entire graph) that extract useful information from such data. To that end, we propose a study of local graph clustering using noisy node labels as a proxy for additional node information. In this setting, nodes receive initial binary labels based on cluster affiliation: 1 if they belong to the target cluster and 0 otherwise. Subsequently, a fraction of these labels is flipped. We investigate the benefits of incorporating noisy labels for local graph clustering. By constructing a weighted graph with such labels, we study the performance of graph diffusion-based local clustering method on both the original and the weighted graphs. From a theoretical perspective, we consider recovering an unknown target cluster with a single seed node in a random graph with independent noisy node labels. We provide sufficient conditions on the label noise under which, with high probability, using diffusion in the weighted graph yields a more accurate recovery of the target cluster. This approach proves more effective than using the given labels alone or using diffusion in the label-free original graph. Empirically, we show that reliable node labels can be obtained with just a few samples from an attributed graph. Moreover, utilizing these labels via diffusion in the weighted graph leads to significantly better local clustering performance across several real-world datasets, improving F1 scores by up to 13%.
翻译:随着图数据中额外节点信息(如文本、图像或标签)相关的机器学习问题日益受到关注,处理整张图的高成本操作被广泛采用的方法所推广。然而,针对从这类数据中提取有用信息而不访问整张图的快速局部方法的研究却鲜有进展。为此,本文提出利用含噪声节点标签作为额外节点信息代理的局部图聚类研究。在该设定下,节点根据聚类归属获得初始二元标签:若属于目标聚类则标为1,否则为0。随后,部分标签被翻转。我们探究了在局部图聚类中融入噪声标签的优势。通过构建基于此类标签的加权图,我们研究了基于图扩散的局部聚类方法在原始图与加权图上的性能。从理论角度,我们考虑在具有独立噪声节点标签的随机图中,利用单一种子节点恢复未知目标聚类的问题。我们给出了标签噪声的充分条件:在此条件下,加权图上的扩散方法能以高概率更准确地恢复目标聚类。该方法比单独使用给定标签或在无标签原始图上进行扩散更为有效。实验表明,仅需从属性图中采样少量样本即可获得可靠节点标签。此外,通过在加权图上利用这些标签进行扩散,在多个真实数据集上实现了显著更优的局部聚类性能,F1分数最高提升13%。