Communication networks such as emails or social networks are now ubiquitous and their analysis has become a strategic field. In many applications, the goal is to automatically extract relevant information by looking at the nodes and their connections. Unfortunately, most of the existing methods focus on analysing the presence or absence of edges and textual data is often discarded. However, all communication networks actually come with textual data on the edges. In order to take into account this specificity, we consider in this paper networks for which two nodes are linked if and only if they share textual data. We introduce a deep latent variable model allowing embedded topics to be handled called ETSBM to simultaneously perform clustering on the nodes while modelling the topics used between the different clusters. ETSBM extends both the stochastic block model (SBM) and the embedded topic model (ETM) which are core models for studying networks and corpora, respectively. The inference is done using a variational-Bayes expectation-maximisation algorithm combined with a stochastic gradient descent. The methodology is evaluated on synthetic data and on a real world dataset.
翻译:通信网络如电子邮件或社交网络如今已无处不在,其分析已成为战略性领域。在许多应用中,目标是通过观察节点及其连接来自动提取相关信息。然而,现有方法大多聚焦于分析边的存在与否,而文本数据往往被忽略。但事实上,所有通信网络都会在边上附带文本数据。为考虑这一特性,本文研究了一种网络——其中两个节点相连当且仅当它们共享文本数据。我们提出一种名为ETSBM的深度潜变量模型,能够处理嵌入主题,同时实现节点聚类并建模不同聚类之间使用的主题。ETSBM分别扩展了研究网络与语料库的核心模型——随机块模型(SBM)和嵌入主题模型(ETM)。推断过程采用结合随机梯度下降的变分贝叶斯期望最大化算法。该方法在合成数据集和真实数据集上进行了评估。