The increasing prevalence of network data in a vast variety of fields and the need to extract useful information out of them have spurred fast developments in related models and algorithms. Among the various learning tasks with network data, community detection, the discovery of node clusters or "communities," has arguably received the most attention in the scientific community. In many real-world applications, the network data often come with additional information in the form of node or edge covariates that should ideally be leveraged for inference. In this paper, we add to a limited literature on community detection for networks with covariates by proposing a Bayesian stochastic block model with a covariate-dependent random partition prior. Under our prior, the covariates are explicitly expressed in specifying the prior distribution on the cluster membership. Our model has the flexibility of modeling uncertainties of all the parameter estimates including the community membership. Importantly, and unlike the majority of existing methods, our model has the ability to learn the number of the communities via posterior inference without having to assume it to be known. Our model can be applied to community detection in both dense and sparse networks, with both categorical and continuous covariates, and our MCMC algorithm is very efficient with good mixing properties. We demonstrate the superior performance of our model over existing models in a comprehensive simulation study and an application to two real datasets.
翻译:随着各领域中网络数据的日益普及以及从中提取有用信息的需求,相关模型与算法得到了快速发展。在网络数据的各类学习任务中,社区检测(即发现节点聚类或“社区”)无疑是科学界关注度最高的领域。在许多实际应用中,网络数据通常附带节点或边协变量形式的附加信息,理想情况下应利用这些信息进行推断。本文通过提出一种带有协变量依赖随机划分先验的贝叶斯随机块模型,对协变量网络中社区检测这一有限文献进行了补充。在该先验下,协变量被显式地用于指定聚类归属的先验分布。我们的模型能够灵活地对所有参数估计(包括社区归属)的不确定性进行建模。重要的是,与大多数现有方法不同,我们的模型能够通过后验推断学习社区数量,而无需假定其已知。该模型可应用于稠密和稀疏网络中的社区检测,同时适用于分类和连续型协变量,并且其马尔可夫链蒙特卡洛算法具有高效性和良好的混合特性。通过在综合模拟研究和两个真实数据集上的应用,我们证明了所提模型相较于现有模型的优越性能。