Faster Streaming and Scalable Algorithms for Finding Directed Dense Subgraphs in Large Graphs

Finding dense subgraphs is a fundamental algorithmic tool in data mining, community detection, and clustering. In this problem, one aims to find an induced subgraph whose edge-to-vertex ratio is maximized. We study the directed case of this question in the context of semi-streaming and massively parallel algorithms. In particular, we show that it is possible to find a $(2+\epsilon)$ approximation on randomized streams even in a single pass by using $O(n \cdot {\rm poly} \log n)$ memory on $n$-vertex graphs. Our result improves over prior works, which were designed for arbitrary-ordered streams: the algorithm by Bahmani et al. (VLDB 2012) which uses $O(\log n)$ passes, and the work by Esfandiari et al. (2015) which makes one pass but uses $O(n^{3/2})$ memory. Moreover, our techniques extend to the Massively Parallel Computation model yielding $O(1)$ rounds in the super-linear and $O(\sqrt{\log n})$ rounds in the nearly-linear memory regime. This constitutes a quadratic improvement over state-of-the-art bounds by Bahmani et al. (VLDB 2012 and WAW 2014), which require $O(\log n)$ rounds even in the super-linear memory regime. Finally, we empirically evaluate our single-pass semi-streaming algorithm on $6$ benchmarks and show that, even on non-randomly ordered streams, the quality of its output is essentially the same as that of Bahmani et al. (VLDB 2012) while it is $2$ times faster on large graphs.

翻译：寻找密集子图是数据挖掘、社区检测和聚类中的基本算法工具。该问题的目标是找到一个边顶点比最大化的诱导子图。我们在半流式和大规模并行算法背景下研究该问题的有向情况。特别地，我们证明：在随机流上，即使只使用一次遍历，也可以使用 $O(n \cdot {\rm poly} \log n)$ 内存（针对 $n$ 顶点图）找到 $(2+\epsilon)$ 近似解。我们的结果优于先前专为任意顺序流设计的工作：Bahmani 等人（VLDB 2012）使用 $O(\log n)$ 次遍历的算法，以及 Esfandiari 等人（2015）仅需一次遍历但使用 $O(n^{3/2})$ 内存的工作。此外，我们的技术可扩展至大规模并行计算模型，在超线性内存场景下仅需 $O(1)$ 轮，在近线性内存场景下仅需 $O(\sqrt{\log n})$ 轮。这相较于 Bahmani 等人（VLDB 2012 和 WAW 2014）的最新边界实现了二次改进，其即使在超线性内存场景下也需要 $O(\log n)$ 轮。最后，我们在 6 个基准测试上对单遍历半流式算法进行实证评估，结果表明：即使在非随机顺序流上，其输出质量与 Bahmani 等人（VLDB 2012）的算法基本相同，而在大型图上速度提升 2 倍。

相关内容

VLDB

关注 18

VLDB是面向数据管理和数据库研究人员、供应商、从业人员、应用程序开发人员等用户的重要国际年度论坛。VLDB 2019会议将以研究报告，教程，演示和研讨会为特色。由于它们是21世纪新兴应用程序的技术基石，因此它将涵盖数据管理，数据库和信息系统研究中的问题。官网地址：http://dblp.uni-trier.de/db/conf/vldb/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日