Numerical interactions leading to users sharing textual content published by others are naturally represented by a network where the individuals are associated with the nodes and the exchanged texts with the edges. To understand those heterogeneous and complex data structures, clustering nodes into homogeneous groups as well as rendering a comprehensible visualisation of the data is mandatory. To address both issues, we introduce Deep-LPTM, a model-based clustering strategy relying on a variational graph auto-encoder approach as well as a probabilistic model to characterise the topics of discussion. Deep-LPTM allows to build a joint representation of the nodes and of the edges in two embeddings spaces. The parameters are inferred using a variational inference algorithm. We also introduce IC2L, a model selection criterion specifically designed to choose models with relevant clustering and visualisation properties. An extensive benchmark study on synthetic data is provided. In particular, we find that Deep-LPTM better recovers the partitions of the nodes than the state-of-the art ETSBM and STBM. Eventually, the emails of the Enron company are analysed and visualisations of the results are presented, with meaningful highlights of the graph structure.
翻译:数值交互导致用户分享他人发布的文本内容,这自然可表示为一个网络,其中个体与节点关联,交换的文本与边关联。为理解这些异构且复杂的数据结构,需将节点聚类为同质组,并实现数据的可理解可视化。为解决这两个问题,我们提出Deep-LPTM,一种基于变分图自编码器策略与概率模型描述讨论主题的模型聚类方法。Deep-LPTM允许在两个嵌入空间中构建节点与边的联合表示。参数通过变分推理算法推断。我们还引入IC2L,一种专门设计用于选择具有相关聚类与可视化属性模型的选择准则。提供了基于合成数据的广泛基准研究。特别地,我们发现Deep-LPTM比现有最先进的ETSBM与STBM能更好地恢复节点划分。最后,分析了安然公司的电子邮件,并展示了结果的可视化,突显了图结构的意义。