Most existing topic models rely on bag-of-words (BOW) representation, which limits their ability to capture word order information and leads to challenges with out-of-vocabulary (OOV) words in new documents. Contextualized word embeddings, however, show superiority in word sense disambiguation and effectively address the OOV issue. In this work, we introduce a novel neural topic model called the Contextlized Word Topic Model (CWTM), which integrates contextualized word embeddings from BERT. The model is capable of learning the topic vector of a document without BOW information. In addition, it can also derive the topic vectors for individual words within a document based on their contextualized word embeddings. Experiments across various datasets show that CWTM generates more coherent and meaningful topics compared to existing topic models, while also accommodating unseen words in newly encountered documents.
翻译:现有大多数主题模型依赖于词袋表示,这限制了其捕捉词序信息的能力,并在新文档中导致词汇外单词的处理挑战。然而,上下文词嵌入在词义消歧方面表现优越,并能有效解决词汇外问题。本文提出了一种名为上下文词主题模型的新型神经主题模型,该模型整合了来自BERT的上下文词嵌入。该模型能够在不依赖词袋信息的情况下学习文档的主题向量。此外,它还能基于上下文词嵌入推导出文档中单个词的主题向量。跨多个数据集的实验表明,与现有主题模型相比,CWTM能生成更连贯且更有意义的主题,同时还能处理新文档中未出现过的词汇。