Scalable Community Search with Accuracy Guarantee on Attributed Graphs

Given an attributed graph $G$ and a query node $q$, \underline{C}ommunity \underline{S}earch over \underline{A}ttributed \underline{G}raphs (CS-AG) aims to find a structure- and attribute-cohesive subgraph from $G$ that contains $q$. Although CS-AG has been widely studied, they still face three challenges. (1) Exact methods based on graph traversal are time-consuming, especially for large graphs. Some tailored indices can improve efficiency, but introduce nonnegligible storage and maintenance overhead. (2) Approximate methods with a loose approximation ratio only provide a coarse-grained evaluation of a community's quality, rather than a reliable evaluation with an accuracy guarantee in runtime. (3) Attribute cohesiveness metrics often ignores the important correlation with the query node $q$. We formally define our CS-AG problem atop a $q$-centric attribute cohesiveness metric considering both textual and numerical attributes, for $k$-core model on homogeneous graphs. We show the problem is NP-hard. To solve it, we first propose an exact baseline with three pruning strategies. Then, we propose an index-free sampling-estimation-based method to quickly return an approximate community with an accuracy guarantee, in the form of a confidence interval. Once a good result satisfying a user-desired error bound is reached, we terminate it early. We extend it to heterogeneous graphs, $k$-truss model, and size-bounded CS. Comprehensive experimental studies on ten real-world datasets show its superiority, e.g., at least 1.54$\times$ (41.1$\times$ on average) faster in response time and a reliable relative error (within a user-specific error bound) of attribute cohesiveness is achieved.

翻译：摘要：给定一个带属性图 $G$ 和一个查询节点 $q$，带属性图上的社区搜索（CS-AG）旨在从 $G$ 中找出一个包含 $q$ 的结构与属性共聚子图。尽管CS-AG已被广泛研究，但仍面临三个挑战：（1）基于图遍历的精确方法计算耗时，尤其在大规模图上表现显著。部分定制索引虽能提升效率，但引入不可忽视的存储与维护开销；（2）近似比宽松的近似方法仅提供社区质量的粗粒度评估，而非运行时具有准确保证的可靠评价；（3）属性共聚度指标常忽略与查询节点 $q$ 的重要关联性。针对同质图上的 $k$-core 模型，我们基于考虑文本与数值属性的 $q$ 中心属性共聚度指标，正式定义了CS-AG问题，并证明其为NP难问题。为求解该问题，我们首先提出一种结合三种剪枝策略的精确基线方法。接着，提出一种无索引的基于采样估计的方法，能以置信区间形式快速返回具有准确保证的近似社区。一旦达到用户期望的误差界内的优质结果，即提前终止算法。我们将该方法扩展至异质图、$k$-truss 模型及有界规模CS。在十个真实数据集上的综合实验表明其优越性，例如响应速度至少提升1.54倍（平均41.1倍），且属性共聚度的相对误差（在用户指定误差界内）具有可靠性。