We propose a new model to address the overlooked problem of node clustering in simple hypergraphs. Simple hypergraphs are suitable when a node may not appear multiple times in the same hyperedge, such as in co-authorship datasets. Our model assumes the existence of latent node groups and hyperedges are conditionally independent given these groups. We first establish the generic identifiability of the model parameters. We then develop a variational approximation Expectation-Maximization algorithm for parameter inference and node clustering, and derive a statistical criterion for model selection. To illustrate the performance of our R package HyperSBM, we compare it with other node clustering methods using synthetic data generated from the model, as well as from a line clustering experiment and a co-authorship dataset. As a by-product, our synthetic experiments demonstrate that the detectability thresholds for non-uniform sparse hypergraphs cannot be deduced from the uniform case.
翻译:我们提出了一种新模型,以解决简单超图中被忽视的节点聚类问题。当节点在同一超边中不会重复出现时(如合著数据集),简单超图具有适用性。该模型假设存在潜在节点组,且超边在给定这些组后条件独立。我们首先建立了模型参数的通用可识别性,随后开发了用于参数推断和节点聚类的变分近似期望最大化算法,并推导出模型选择的统计准则。为展示我们的R包HyperSBM的性能,我们将其与基于模型生成合成数据、线条聚类实验及合著数据集的其他节点聚类方法进行了比较。作为副产品,我们的合成实验表明非均匀稀疏超图的检测阈值无法从均匀情形推导得出。