Effective Clustering on Large Attributed Bipartite Graphs

Attributed bipartite graphs (ABGs) are an expressive data model for describing the interactions between two sets of heterogeneous nodes that are associated with rich attributes, such as customer-product purchase networks and author-paper authorship graphs. Partitioning the target node set in such graphs into k disjoint clusters (referred to as k-ABGC) finds widespread use in various domains, including social network analysis, recommendation systems, information retrieval, and bioinformatics. However, the majority of existing solutions towards k-ABGC either overlook attribute information or fail to capture bipartite graph structures accurately, engendering severely compromised result quality. The severity of these issues is accentuated in real ABGs, which often encompass millions of nodes and a sheer volume of attribute data, rendering effective k-ABGC over such graphs highly challenging. In this paper, we propose TPO, an effective and efficient approach to k-ABGC that achieves superb clustering performance on multiple real datasets. TPO obtains high clustering quality through two major contributions: (i) a novel formulation and transformation of the k-ABGC problem based on multi-scale attribute affinity specialized for capturing attribute affinities between nodes with the consideration of their multi-hop connections in ABGs, and (ii) a highly efficient solver that includes a suite of carefully-crafted optimizations for sidestepping explicit affinity matrix construction and facilitating faster convergence. Extensive experiments, comparing TPO against 19 baselines over 5 real ABGs, showcase the superior clustering quality of TPO measured against ground-truth labels. Moreover, compared to the state of the arts, TPO is often more than 40x faster over both small and large ABGs.

翻译：属性二分图（ABGs）是一种富有表现力的数据模型，用于描述两类异质节点之间的交互，这些节点关联着丰富的属性，例如顾客-产品购买网络和作者-论文作者关系图。在这类图中将目标节点集划分为k个不相交的簇（称为k-ABGC）广泛应用于社交网络分析、推荐系统、信息检索和生物信息学等多个领域。然而，现有的大多数k-ABGC解决方案要么忽略属性信息，要么未能准确捕捉二分图结构，导致聚类结果质量严重下降。在实际ABGs中，这些问题尤为突出——真实ABGs通常包含数百万个节点和大量属性数据，使得在此类图上进行有效的k-ABGC极具挑战性。本文提出TPO，一种有效且高效的k-ABGC方法，在多个真实数据集上实现了卓越的聚类性能。TPO通过两大主要贡献获得高质量的聚类：（i）基于多尺度属性亲和度的新颖k-ABGC问题形式化与转换——专门为捕捉节点间的属性亲和度而设计，同时考虑它们在ABG中的多跳连接；（ii）一个高效求解器，包含一套精心设计的优化策略，从而避免显式构建亲和矩阵并加速收敛。与19种基线方法在5个真实ABG上的广泛实验表明，TPO的聚类质量优于真实标签的度量结果。此外，与现有最优方法相比，TPO在小型和大型ABG上的速度通常提升40倍以上。