Numerical interactions leading to users sharing textual content published by others are naturally represented by a network where the individuals are associated with the nodes and the exchanged texts with the edges. To understand those heterogeneous and complex data structures, clustering nodes into homogeneous groups as well as rendering a comprehensible visualisation of the data is mandatory. To address both issues, we introduce Deep-LPTM, a model-based clustering strategy relying on a variational graph auto-encoder approach as well as a probabilistic model to characterise the topics of discussion. Deep-LPTM allows to build a joint representation of the nodes and of the edges in two embeddings spaces. The parameters are inferred using a variational inference algorithm. We also introduce IC2L, a model selection criterion specifically designed to choose models with relevant clustering and visualisation properties. An extensive benchmark study on synthetic data is provided. In particular, we find that Deep-LPTM better recovers the partitions of the nodes than the state-of-the art ETSBM and STBM. Eventually, the emails of the Enron company are analysed and visualisations of the results are presented, with meaningful highlights of the graph structure.
翻译:数值交互导致用户分享他人发布的文本内容,这类交互自然地被表示为一个网络,其中个体对应节点,交换的文本对应边。为了理解这些异质且复杂的数据结构,将节点聚类为同质组以及生成数据可理解的直观可视化是必要的。为解决这两个问题,我们提出了Deep-LPTM,这是一种基于模型的聚类策略,它结合了变分图自编码器方法以及用于表征讨论主题的概率模型。Deep-LPTM能够在两个嵌入空间中对节点和边构建联合表示。其参数通过变分推断算法进行估计。我们还引入了IC2L,这是一种专门设计的模型选择准则,用于选取具有良好聚类和可视化性质的模型。基于合成数据进行了广泛的基准研究。特别地,我们发现Deep-LPTM在恢复节点划分方面优于现有最优的ETSBM和STBM。最后,分析了安然公司的电子邮件并展示了结果的可视化,其中突出了图结构的有意义特征。