Timestamped relational datasets consisting of records between pairs of entities are ubiquitous in data and network science. For applications like peer-to-peer communication, email, social network interactions, and computer network security, it makes sense to organize these records into groups based on how and when they are occurring. Weighted line graphs offer a natural way to model how records are related in such datasets but for large real-world graph topologies the complexity of building and utilizing the line graph is prohibitive. We present an algorithm to cluster the edges of a dynamic graph via the associated line graph without forming it explicitly. We outline a novel hierarchical dynamic graph edge clustering approach that efficiently breaks massive relational datasets into small sets of edges containing events at various timescales. This is in stark contrast to traditional graph clustering algorithms that prioritize highly connected community structures. Our approach relies on constructing a sufficient subgraph of a weighted line graph and applying a hierarchical agglomerative clustering. This work draws particular inspiration from HDBSCAN. We present a parallel algorithm and show that it is able to break billion-scale dynamic graphs into small sets that correlate in topology and time. The entire clustering process for a graph with $O(10 \text{ billion})$ edges takes just a few minutes of run time on 256 nodes of a distributed compute environment. We argue how the output of the edge clustering is useful for a multitude of data visualization and powerful machine learning tasks, both involving the original massive dynamic graph data and/or the non-relational metadata. Finally, we demonstrate its use on a real-world large-scale directed dynamic graph and describe how it can be extended to dynamic hypergraphs and graphs with unstructured data living on vertices and edges.
翻译:时间戳关系数据集(包含实体对之间的记录)在数据科学和网络科学中普遍存在。针对点对点通信、电子邮件、社交网络交互及计算机网络安全等应用场景,基于记录发生的方式与时间将其组织成不同群组具有重要实践意义。加权线图虽然为建模此类数据集中记录间的关系提供了自然范式,但针对大规模真实图拓扑结构,构建和运用线图的计算复杂度极高。本文提出一种无需显式构建线图即可通过关联线图对动态图边进行聚类的算法。我们设计了一种新型层次化动态图边聚类方法,能高效地将海量关系数据集分解为包含不同时间尺度事件的小规模边集合。这与传统优先识别高连通社区结构的图聚类算法形成鲜明对比。该方法的核心在于构建加权线图的足够子图并应用层次凝聚聚类,其设计特别借鉴了HDBSCAN算法。我们提出了并行化算法,证明其能将十亿级动态图分解为在拓扑结构和时间维度上具有相关性的小规模集合。对于包含$O(10\text{ billion})$条边的图,整个聚类过程在256节点的分布式计算环境中仅需数分钟运行时间。我们论证了边聚类输出对涉及原始大规模动态图数据和/或非关系元数据的多种数据可视化任务及强大机器学习任务的价值。最后,我们在实际大规模有向动态图上验证了该方法,并阐明其向动态超图及顶点/边包含非结构化数据的图进行扩展的路径。