In recent years, the problem of computing the frequencies of the induced $k$-vertex subgraphs of a graph, or \emph{$k$-graphlets}, has become central. One approach for this problem is to sample $k$-graphlets randomly. Classic algorithms for $k$-graphlet sampling require loading the entire graph into main memory, making them impractical for massive graphs. To bypass this limitation, Bourreau et al. (NeurIPS 2024) introduced a \emph{streaming} algorithm that through nontrivial techniques makes only $O(\log n)$ passes using $O(n \log n)$ memory. In this work we break their $O(\log n)$-pass bound by giving an algorithm that, for any fixed $c>0$, makes $O(1/c)$ passes using $\tilde O(n^{1+c})$ memory. As a consequence of their lower bound, our algorithm is optimal up to a factor of $\tilde{O}(n^c)$ in the memory usage. We use this sampling algorithm to obtain an efficient method of approximating $k$-graphlet distributions. Experiments on real-world and synthetic graphs show that our algorithm is always at least as good as the one of Bourreau et al., and outperforms it by orders of magnitude on mildly dense graphs.
翻译:近年来,计算图中诱导$k$顶点子图(即$k$-图元)的频率问题已成为核心研究课题。解决该问题的一种方法是对$k$-图元进行随机采样。经典的$k$-图元采样算法需要将整个图加载到主存中,这使得它们难以应用于大规模图。为突破这一限制,Bourreau等人(NeurIPS 2024)提出了一种流式算法,通过非平凡技术仅需$O(\log n)$次遍历并使用$O(n \log n)$内存。在本工作中,我们打破了其$O(\log n)$次遍历的上界,提出一种算法:对于任意固定$c>0$,该算法仅需$O(1/c)$次遍历,且内存使用量为$\tilde O(n^{1+c})$。根据其下界结果,我们的算法在内存使用上至多相差$\tilde{O}(n^c)$因子,已达到最优。我们利用该采样算法获得了一种高效近似$k$-图元分布的方法。在真实图与合成图上的实验表明,我们的算法性能始终不逊于Bourreau等人的方法,且在中等密度图上比其高出数个数量级。