Finding dense subgraphs is a fundamental algorithmic tool in data mining, community detection, and clustering. In this problem, one aims to find an induced subgraph whose edge-to-vertex ratio is maximized. We study the directed case of this question in the context of semi-streaming and massively parallel algorithms. In particular, we show that it is possible to find a $(2+\epsilon)$ approximation on randomized streams even in a single pass by using $O(n \cdot {\rm poly} \log n)$ memory on $n$-vertex graphs. Our result improves over prior works, which were designed for arbitrary-ordered streams: the algorithm by Bahmani et al. (VLDB 2012) which uses $O(\log n)$ passes, and the work by Esfandiari et al. (2015) which makes one pass but uses $O(n^{3/2})$ memory. Moreover, our techniques extend to the Massively Parallel Computation model yielding $O(1)$ rounds in the super-linear and $O(\sqrt{\log n})$ rounds in the nearly-linear memory regime. This constitutes a quadratic improvement over state-of-the-art bounds by Bahmani et al. (VLDB 2012 and WAW 2014), which require $O(\log n)$ rounds even in the super-linear memory regime. Finally, we empirically evaluate our single-pass semi-streaming algorithm on $6$ benchmarks and show that, even on non-randomly ordered streams, the quality of its output is essentially the same as that of Bahmani et al. (VLDB 2012) while it is $2$ times faster on large graphs.
翻译:寻找密集子图是数据挖掘、社区检测和聚类中的基本算法工具。该问题的目标是找到一个边顶点比最大化的诱导子图。我们在半流式和大规模并行算法背景下研究该问题的有向情况。特别地,我们证明:在随机流上,即使只使用一次遍历,也可以使用 $O(n \cdot {\rm poly} \log n)$ 内存(针对 $n$ 顶点图)找到 $(2+\epsilon)$ 近似解。我们的结果优于先前专为任意顺序流设计的工作:Bahmani 等人(VLDB 2012)使用 $O(\log n)$ 次遍历的算法,以及 Esfandiari 等人(2015)仅需一次遍历但使用 $O(n^{3/2})$ 内存的工作。此外,我们的技术可扩展至大规模并行计算模型,在超线性内存场景下仅需 $O(1)$ 轮,在近线性内存场景下仅需 $O(\sqrt{\log n})$ 轮。这相较于 Bahmani 等人(VLDB 2012 和 WAW 2014)的最新边界实现了二次改进,其即使在超线性内存场景下也需要 $O(\log n)$ 轮。最后,我们在 6 个基准测试上对单遍历半流式算法进行实证评估,结果表明:即使在非随机顺序流上,其输出质量与 Bahmani 等人(VLDB 2012)的算法基本相同,而在大型图上速度提升 2 倍。