Numerical interactions leading to users sharing textual content published by others are naturally represented by a network where the individuals are associated with the nodes and the exchanged texts with the edges. To understand those heterogeneous and complex data structures, clustering nodes into homogeneous groups as well as rendering a comprehensible visualisation of the data is mandatory. To address both issues, we introduce Deep-LPTM, a model-based clustering strategy relying on a variational graph auto-encoder approach as well as a probabilistic model to characterise the topics of discussion. Deep-LPTM allows to build a joint representation of the nodes and of the edges in two embeddings spaces. The parameters are inferred using a variational inference algorithm. We also introduce IC2L, a model selection criterion specifically designed to choose models with relevant clustering and visualisation properties. An extensive benchmark study on synthetic data is provided. In particular, we find that Deep-LPTM better recovers the partitions of the nodes than the state-of-the art ETSBM and STBM. Eventually, the emails of the Enron company are analysed and visualisations of the results are presented, with meaningful highlights of the graph structure.
翻译:数值交互导致用户分享他人发布的文本内容,这类现象自然地由一种网络表示:个体对应节点,交换的文本对应边。为了理解这些异构复杂的数据结构,将节点聚类为同质组并提供数据可理解的可视化是必要的。为解决这两个问题,我们提出了Deep-LPTM,一种基于变分图自编码器策略以及概率模型来刻画讨论主题的模型化聚类方法。Deep-LPTM能够在两个嵌入空间中构建节点与边的联合表示,参数通过变分推理算法推断。我们还引入了IC2L,一种专门设计的模型选择准则,用于选取具有相关聚类与可视化特性的模型。基于合成数据开展了广泛的基准研究。特别地,我们发现Deep-LPTM比当前最先进的ETSBM和STBM能更有效地恢复节点划分。最后,对安然公司的电子邮件进行了分析,并展示了结果的可视化,其中图结构得到了有意义的突出展示。