Network data are observed in various applications where the individual entities of the system interact with or are connected to each other, and often these interactions are defined by their associated strength or importance. Clustering is a common task in network analysis that involves finding groups of nodes displaying similarities in the way they interact with the rest of the network. However, most clustering methods use the strengths of connections between entities in their original form, ignoring the possible differences in the capacities of individual nodes to send or receive edges. This often leads to clustering solutions that are heavily influenced by the nodes' capacities. One way to overcome this is to analyse the strengths of connections in relative rather than absolute terms, expressing each edge weight as a proportion of the sending (or receiving) capacity of the respective node. This, however, induces additional modelling constraints that most existing clustering methods are not designed to handle. In this work we propose a stochastic block model for composition-weighted networks based on direct modelling of compositional weight vectors using a Dirichlet mixture, with the parameters determined by the cluster labels of the sender and the receiver nodes. Inference is implemented via an extension of the classification expectation-maximisation algorithm that uses a working independence assumption, expressing the complete data likelihood of each node of the network as a function of fixed cluster labels of the remaining nodes. A model selection criterion is derived to aid the choice of the number of clusters. The model is validated using simulation studies, and showcased on network data from the Erasmus exchange program and a bike sharing network for the city of London.
翻译:网络数据广泛存在于各类应用中,其中系统的个体实体相互交互或连接,且这些交互通常由其关联的强度或重要性定义。聚类是网络分析中的常见任务,旨在发现与网络其余部分交互方式相似的节点群组。然而,大多数聚类方法直接使用实体间连接强度的原始数值,忽略了单个节点发送或接收边的能力可能存在的差异。这往往导致聚类结果受到节点能力的显著影响。克服此问题的一种途径是以相对而非绝对的方式分析连接强度,将每条边的权重表达为相应节点发送(或接收)能力所占的比例。然而,这种做法引入了额外的建模约束,而大多数现有聚类方法并未设计处理此类约束。本研究提出一种针对成分加权网络的随机分块模型,其基于对成分权重向量的直接建模,采用狄利克雷混合分布,其参数由发送节点与接收节点的聚类标签决定。推理过程通过分类期望最大化算法的扩展实现,该算法采用工作独立性假设,将网络中每个节点的完整数据似然表达为其余节点固定聚类标签的函数。本文推导了模型选择准则以辅助确定聚类数量。通过模拟研究验证了模型的有效性,并以伊拉斯谟交换项目数据和伦敦城市自行车共享网络数据为例进行了展示。