Topic modeling is a widely used approach for analyzing and exploring large document collections. Recent research efforts have incorporated pre-trained contextualized language models, such as BERT embeddings, into topic modeling. However, they often neglect the intrinsic informational value conveyed by mutual dependencies between words. In this study, we introduce GINopic, a topic modeling framework based on graph isomorphism networks to capture the correlation between words. By conducting intrinsic (quantitative as well as qualitative) and extrinsic evaluations on diverse benchmark datasets, we demonstrate the effectiveness of GINopic compared to existing topic models and highlight its potential for advancing topic modeling.
翻译:主题建模是一种广泛应用于分析和探索大规模文档集合的方法。近期研究尝试将预训练的上下文语言模型(如BERT嵌入)融入主题建模。然而,这些方法往往忽略了词语间相互依赖关系所传递的内在信息价值。本研究提出了GINopic,一种基于图同构网络的主题建模框架,旨在捕捉词语间的相关性。通过在多个基准数据集上进行内在(定量与定性)和外在评估,我们证明了GINopic相较于现有主题模型的有效性,并凸显了其在推动主题建模发展方面的潜力。