With the development of neural topic models in recent years, topic modelling is playing an increasingly important role in natural language understanding. However, most existing topic models still rely on bag-of-words (BoW) information, either as training input or training target. This limits their ability to capture word order information in documents and causes them to suffer from the out-of-vocabulary (OOV) issue, i.e. they cannot handle unobserved words in new documents. Contextualized word embeddings from pre-trained language models show superiority in the ability of word sense disambiguation and prove to be effective in dealing with OOV words. In this work, we developed a novel neural topic model combining contextualized word embeddings from the pre-trained language model BERT. The model can infer the topic distribution of a document without using any BoW information. In addition, the model can infer the topic distribution of each word in a document directly from the contextualized word embeddings. Experiments on several datasets show that our model outperforms existing topic models in terms of both document classification and topic coherence metrics and can accommodate unseen words from newly arrived documents. Experiments on the NER dataset also show that our model can produce high-quality word topic representations.
翻译:随着近年来神经主题模型的发展,主题建模在自然语言理解中发挥着越来越重要的作用。然而,大多数现有主题模型仍依赖词袋(BoW)信息,无论是作为训练输入还是训练目标。这限制了它们捕捉文档中词序信息的能力,并导致其面临词表外(OOV)问题,即无法处理新文档中未观察到的词汇。来自预训练语言模型的上下文词嵌入在词义消歧能力上展现出优越性,并被证明可有效处理OOV词汇。本研究开发了一种新型神经主题模型,该模型结合了预训练语言模型BERT的上下文词嵌入。该模型可在不使用任何BoW信息的情况下推断文档的主题分布。此外,该模型可直接从上下文词嵌入中推断文档中每个词的主题分布。在多个数据集上的实验表明,我们的模型在文档分类和主题一致性指标上均优于现有主题模型,并能适应新文档中未出现的词汇。在NER数据集上的实验还表明,我们的模型能生成高质量的单词主题表示。