Approximate graph pattern mining (A-GPM) is an important data analysis tool for many graph-based applications. There exist sampling-based A-GPM systems to provide automation and generalization over a wide variety of use cases. However, there are two major obstacles that prevent existing A-GPM systems being adopted in practice. First, the termination mechanism that decides when to end sampling lacks theoretical backup on confidence, and is unstable and slow in practice. Second, they suffer poor performance when dealing with the "needle-in-the-hay" cases, because a huge number of samples are required to converge, given the extremely low hit rate of their fixed sampling schemes. We build ScaleGPM, an accurate and fast A-GPM system that removes the two obstacles. First, we propose a novel on-the-fly convergence detection mechanism to achieve stable termination and provide theoretical guarantee on the confidence, with negligible overhead. Second, we propose two techniques to deal with the "needle-in-the-hay" problem, eager-verify and hybrid sampling. Our eager-verify method improves sampling hit rate by pruning unpromising candidates as early as possible. Hybrid sampling improves performance by automatically choosing the better scheme between fine-grained and coarse-grained sampling schemes. Experiments show that our online convergence detection mechanism can detect convergence and results in stable and rapid termination with theoretically guaranteed confidence. We show the effectiveness of eager-verify in improving the hit rate, and the scheme-selection mechanism in correctly choosing the better scheme for various cases. Overall, ScaleGPM achieves a geomean average of 565x (up to 610169x) speedup over the state-of-the-art A-GPM system, Arya. In particular, ScaleGPM handles billion-scale graphs in seconds, where existing systems either run out of memory or fail to complete in hours.
翻译:近似图模式挖掘(A-GPM)是众多基于图的数据分析应用中的重要工具。现有基于采样的A-GPM系统能够为多种使用场景提供自动化和泛化能力。然而,这些系统在实际应用中面临两大障碍:第一,决定采样终止的机制缺乏置信度理论支撑,实际运行中既不稳定又速度缓慢;第二,在处理"大海捞针"型案例时性能极差,由于固定采样方案的极低命中率,需要海量样本才能收敛。我们构建了ScaleGPM系统——一种精确快速的A-GPM系统,成功消除了上述两大障碍。首先,我们提出新型在线收敛检测机制,在极小开销下实现稳定终止并提供置信度理论保证。其次,针对"大海捞针"问题提出两项技术:急切验证与混合采样。急切验证方法通过尽早剪枝无前景候选样本提升采样命中率;混合采样则通过自动选择粗粒度与细粒度采样方案中的更优者提升性能。实验表明,我们的在线收敛检测机制能够检测收敛状态,在保证理论置信度的前提下实现稳定快速的终止。我们验证了急切验证在提升命中率方面的有效性,以及方案选择机制在各类场景中正确选择更优方案的效能。总体而言,ScaleGPM相较于当前最优A-GPM系统Arya实现了565倍(最高达610169倍)的几何平均加速比。特别地,ScaleGPM能在数秒内处理十亿级规模图数据,而现有系统要么内存耗尽,要么数小时内无法完成计算。