Pre-trained language models have led to a new state-of-the-art in many NLP tasks. However, for topic modeling, statistical generative models such as LDA are still prevalent, which do not easily allow incorporating contextual word vectors. They might yield topics that do not align very well with human judgment. In this work, we propose a novel topic modeling and inference algorithm. We suggest a bag of sentences (BoS) approach using sentences as the unit of analysis. We leverage pre-trained sentence embeddings by combining generative process models with clustering. We derive a fast inference algorithm based on expectation maximization, hard assignments, and an annealing process. Our evaluation shows that our method yields state-of-the art results with relatively little computational demands. Our methods is more flexible compared to prior works leveraging word embeddings, since it provides the possibility to customize topic-document distributions using priors. Code is at \url{https://github.com/JohnTailor/BertSenClu}.
翻译:预训练语言模型已在许多自然语言处理任务中达到了最新技术发展水平。然而,在主题建模中,诸如LDA等统计生成模型仍然占据主导地位,这些模型不易整合上下文词向量,且可能生成与人类判断并不完全吻合的主题。本文提出了一种新颖的主题建模与推理算法。我们采用句子作为分析单元,提出了句子袋方法。通过将生成过程模型与聚类相结合,我们利用了预训练句子嵌入。我们推导出一种基于期望最大化、硬分配和退火过程的快速推理算法。评估结果表明,我们的方法在计算需求相对较低的情况下取得了最先进的成果。与先前利用词嵌入的工作相比,我们的方法更为灵活,因为它提供了使用先验知识自定义主题-文档分布的可能性。代码见\url{https://github.com/JohnTailor/BertSenClu}。