Topic models aim to reveal latent structures within a corpus of text, typically through the use of term-frequency statistics over bag-of-words representations from documents. In recent years, conceptual entities -- interpretable, language-independent features linked to external knowledge resources -- have been used in place of word-level tokens, as words typically require extensive language processing with a minimal assurance of interpretability. However, current literature is limited when it comes to exploring purely entity-driven neural topic modeling. For instance, despite the advantages of using entities for eliciting thematic structure, it is unclear whether current techniques are compatible with these sparsely organised, information-dense conceptual units. In this work, we explore entity-based neural topic modeling and propose a novel topic clustering approach using bimodal vector representations of entities. Concretely, we extract these latent representations from large language models and graph neural networks trained on a knowledge base of symbolic relations, in order to derive the most salient aspects of these conceptual units. Analysis of coherency metrics confirms that our approach is better suited to working with entities in comparison to state-of-the-art models, particularly when using graph-based embeddings trained on a knowledge base.
翻译:主题模型旨在揭示文本语料库中的潜在结构,通常通过对文档的词袋表示进行词频统计来实现。近年来,概念实体——即与外部知识资源相关联、可解释且独立于语言的特征——已被用来替代词级标记,因为词语通常需要大量语言处理且可解释性难以保证。然而,当前文献在探索纯实体驱动的神经主题建模方面仍存在局限。例如,尽管使用实体具有激发主题结构的优势,但现有技术是否与这些稀疏组织、信息密集的概念单元兼容尚不明确。在本研究中,我们探索基于实体的神经主题建模,并提出一种利用实体双模态向量表示的新型主题聚类方法。具体而言,我们从大型语言模型及在符号关系知识库上训练的图神经网络中提取这些潜在表示,以推导这些概念单元最显著的语义特征。连贯性指标分析证实,相较于现有先进模型,我们的方法更适用于实体处理,尤其是在使用基于知识库训练的图嵌入时表现更为突出。