Counting Butterflies over Streaming Bipartite Graphs with Duplicate Edges

Bipartite graphs are commonly used to model relationships between two distinct entities in real-world applications, such as user-product interactions, user-movie ratings and collaborations between authors and publications. A butterfly (a 2x2 bi-clique) is a critical substructure in bipartite graphs, playing a significant role in tasks like community detection, fraud detection, and link prediction. As more real-world data is presented in a streaming format, efficiently counting butterflies in streaming bipartite graphs has become increasingly important. However, most existing algorithms typically assume that duplicate edges are absent, which is hard to hold in real-world graph streams, as a result, they tend to sample edges that appear multiple times, leading to inaccurate results. The only algorithm designed to handle duplicate edges is FABLE, but it suffers from significant limitations, including high variance, substantial time complexity, and memory inefficiency due to its reliance on a priority queue. To overcome these limitations, we introduce DEABC (Duplicate-Edge-Aware Butterfly Counting), an innovative method that uses bucket-based priority sampling to accurately estimate the number of butterflies, accounting for duplicate edges. Compared to existing methods, DEABC significantly reduces memory usage by storing only the essential sampled edge data while maintaining high accuracy. We provide rigorous proofs of the unbiasedness and variance bounds for DEABC, ensuring they achieve high accuracy. We compare DEABC with state-of-the-art algorithms on real-world streaming bipartite graphs. The results show that our DEABC outperforms existing methods in memory efficiency and accuracy, while also achieving significantly higher throughput.

翻译：二分图常用于建模现实应用中两类不同实体间的关系，例如用户-产品交互、用户-电影评分以及作者与出版物间的合作关系。蝴蝶结构（即2x2双团）是二分图中的关键子结构，在社区发现、欺诈检测和链接预测等任务中具有重要作用。随着越来越多的现实数据以流式形式呈现，在流式二分图中高效计数蝴蝶结构变得日益重要。然而，现有算法大多假设不存在重复边，这一假设在现实图流中难以成立，导致它们倾向于对多次出现的边进行采样，从而产生不准确的结果。目前唯一能处理重复边的算法是FABLE，但其存在显著局限性，包括高方差、较大的时间复杂度以及因依赖优先队列而导致的内存效率低下。为克服这些限制，我们提出了DEABC（重复边感知蝴蝶计数），这是一种创新方法，采用基于桶的优先级采样来准确估计蝴蝶数量，同时考虑重复边的影响。与现有方法相比，DEABC通过仅存储必要的采样边数据显著降低了内存使用，同时保持了高精度。我们为DEABC的无偏性和方差界限提供了严格的理论证明，确保其能达到高准确性。我们在真实流式二分图上将DEABC与前沿算法进行比较，结果表明DEABC在内存效率与准确性方面均优于现有方法，同时实现了显著更高的吞吐量。