Fast and Space-Efficient Parallel Algorithms for Influence Maximization

Influence Maximization (IM) is a crucial problem in data science. The goal is to find a fixed-size set of highly-influential seed vertices on a network to maximize the influence spread along the edges. While IM is NP-hard on commonly-used diffusion models, a greedy algorithm can achieve $(1-1/e)$-approximation, repeatedly selecting the vertex with the highest marginal gain in influence as the seed. Due to theoretical guarantees, rich literature focuses on improving the performance of the greedy algorithm. To estimate the marginal gain, existing work either runs Monte Carlo (MC) simulations of influence spread or pre-stores hundreds of sketches (usually per-vertex information). However, these approaches can be inefficient in time (MC simulation) or space (storing sketches), preventing the ideas from scaling to today's large-scale graphs. This paper significantly improves the scalability of IM using two key techniques. The first is a sketch-compression technique for the independent cascading model on undirected graphs. It allows combining the simulation and sketching approaches to achieve a time-space tradeoff. The second technique includes new data structures for parallel seed selection. Using our new approaches, we implemented PaC-IM: Parallel and Compressed IM. We compare PaC-IM with state-of-the-art parallel IM systems on a 96-core machine with 1.5TB memory. PaC-IM can process large-scale graphs with up to 900M vertices and 74B edges in about 2 hours. On average across all tested graphs, our uncompressed version is 5--18$\times$ faster and about 1.4$\times$ more space-efficient than existing parallel IM systems. Using compression further saves 3.8$\times$ space with only 70% overhead in time on average.

翻译：影响力最大化是数据科学中的一个关键问题，其目标是在网络中寻找固定大小的高影响力种子节点集合，以最大化沿边缘传播的影响力。在常用的扩散模型上，影响力最大化是NP难的，但贪心算法可实现$(1-1/e)$近似，通过反复选择边际影响力增益最大的顶点作为种子。由于理论保证，大量文献致力于提升贪心算法的性能。为估计边际增益，现有工作要么运行影响力传播的蒙特卡洛模拟，要么预存数百个草图（通常为每个顶点的信息）。然而，这些方法在时间（蒙特卡洛模拟）或空间（存储草图）上可能低效，阻碍了其扩展至当今大规模图。本文通过两项关键技术显著提升了影响力最大化的可扩展性。其一是针对无向图独立级联模型的草图压缩技术，该技术可结合模拟与草图方法，实现时间与空间的权衡。其二是包含用于并行种子选择的新数据结构。利用新方法，我们实现了PaC-IM：并行压缩影响力最大化框架。我们将PaC-IM与现有最先进并行影响力最大化系统在配备96核及1.5TB内存的机器上进行对比。PaC-IM能在约2小时内处理包含多达9亿顶点和740亿边的大规模图。在所有测试图上，我们的未压缩版本比现有并行影响力最大化系统快5至18倍，且空间效率高约1.4倍。采用压缩后，平均仅增加70%的时间开销，即可进一步节省3.8倍空间。

相关内容

关注 1

IM：IFIP/IEEE International Symposium on Integrated Network Management。 Explanation：综合网络管理国际研讨会。 Publisher：IFIP/IEEE SIT： http://dblp.uni-trier.de/db/conf/im/index.html

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日