Community detection in networks is a fundamental problem in machine learning and statistical inference, with applications in social networks, biological systems, and communication networks. The stochastic block model (SBM) serves as a canonical framework for studying community structure, and exact recovery, identifying the true communities with high probability, is a central theoretical question. While classical results characterize the phase transition for exact recovery based solely on graph connectivity, many real-world networks contain additional data, such as node attributes or labels. In this work, we study exact recovery in the Data Block Model (DBM), an SBM augmented with node-associated data, as formalized by Asadi, Abbe, and Verdú (2017). We introduce the Chernoff--TV divergence and use it to characterize a sharp exact recovery threshold for the DBM. We further provide an efficient algorithm that achieves this threshold, along with a matching converse result showing impossibility below the threshold. Finally, simulations validate our findings and demonstrate the benefits of incorporating vertex data as side information in community detection.
翻译:网络中的社区检测是机器学习和统计推断中的一个基本问题,在社交网络、生物系统和通信网络中具有广泛应用。随机块模型(SBM)是研究社区结构的经典框架,而精确恢复——即以高概率识别真实社区——是一个核心理论问题。虽然经典结果仅基于图连通性刻画了精确恢复的相变,但许多现实世界网络包含额外数据,例如节点属性或标签。在本工作中,我们研究了数据块模型(DBM)中的精确恢复问题,该模型是由Asadi、Abbe和Verdú(2017)形式化的、融合了节点关联数据的SBM扩展模型。我们引入了Chernoff-TV散度,并利用它刻画了DBM的尖锐精确恢复阈值。我们进一步提出了一种达到该阈值的有效算法,并给出了匹配的逆结果,证明低于该阈值时精确恢复是不可能的。最后,仿真实验验证了我们的发现,并证明了在社区检测中融合顶点数据作为辅助信息的优势。