Topic modeling is a widely used approach for analyzing and exploring large document collections. Recent research efforts have incorporated pre-trained contextualized language models, such as BERT embeddings, into topic modeling. However, they often neglect the intrinsic informational value conveyed by mutual dependencies between words. In this study, we introduce GINopic, a topic modeling framework based on graph isomorphism networks to capture the correlation between words. By conducting intrinsic (quantitative as well as qualitative) and extrinsic evaluations on diverse benchmark datasets, we demonstrate the effectiveness of GINopic compared to existing topic models and highlight its potential for advancing topic modeling.
翻译:主题建模是一种广泛用于分析和探索大规模文档集合的方法。近期研究尝试将预训练的上下文语言模型(如BERT嵌入)整合到主题建模中,但这些方法往往忽略了单词间相互依赖关系所传递的内在信息价值。本研究提出GINopic——一种基于图同构网络的主题建模框架,旨在捕捉单词间的相关性。通过在多个基准数据集上开展内在(涵盖定量与定性分析)及外在评估,我们证明了GINopic相较于现有主题模型的有效性,并凸显了其在推动主题建模发展方面的潜力。