Approximate Graph Pattern Mining (AGPM) is essential for analyzing large-scale graphs where exact counting is computationally prohibitive. While there exist numerous sampling-based AGPM systems, they all rely on uniform sampling and overlook the underlying probability distribution. This limitation restricts their scalability to a broader range of patterns. In this paper, we introduce AGIS, an extremely fast AGPM system capable of counting arbitrary patterns from huge graphs. AGIS employs structure-informed neighbor sampling, a novel sampling technique that deviates from uniformness but allocates specific sampling probabilities based on the pattern structure. We first derive the ideal sampling distribution for AGPM and then present a practical method to approximate it. Furthermore, we develop a method that balances convergence speed and computational overhead, determining when to use the approximated distribution. Experimental results demonstrate that AGIS significantly outperforms the state-of-the-art AGPM system, achieving 28.5x geometric mean speedup and more than 100,000x speedup in specific cases. Furthermore, AGIS is the only AGPM system that scales to graphs with tens of billions of edges and robustly handles diverse patterns, successfully providing accurate estimates within seconds. We will open-source AGIS to encourage further research in this field.
翻译:近似图模式挖掘(AGPM)对于分析大规模图至关重要,因为精确计数在计算上不可行。尽管存在许多基于采样的AGPM系统,但它们都依赖于均匀采样,忽略了潜在的概率分布。这一限制阻碍了它们扩展到更广泛的模式类型。本文介绍了AGIS,一种能够从巨型图中计数任意模式的极快速AGPM系统。AGIS采用结构感知邻居采样,这是一种新颖的采样技术,它偏离均匀性,而是根据模式结构分配特定的采样概率。我们首先推导出AGPM的理想采样分布,然后提出一种近似该分布的实用方法。此外,我们开发了一种平衡收敛速度与计算开销的方法,以确定何时使用近似分布。实验结果表明,AGIS显著优于最先进的AGPM系统,实现了28.5倍的几何平均加速比,在特定情况下加速比超过100,000倍。更重要的是,AGIS是唯一能够扩展到具有数百亿条边的图,并稳健处理多样化模式的AGPM系统,可在数秒内提供准确估计。我们将开源AGIS以促进该领域的进一步研究。