Pre-trained language models have led to a new state-of-the-art in many NLP tasks. However, for topic modeling, statistical generative models such as LDA are still prevalent, which do not easily allow incorporating contextual word vectors. They might yield topics that do not align very well with human judgment. In this work, we propose a novel topic modeling and inference algorithm. We suggest a bag of sentences (BoS) approach using sentences as the unit of analysis. We leverage pre-trained sentence embeddings by combining generative process models with clustering. We derive a fast inference algorithm based on expectation maximization, hard assignments, and an annealing process. Our evaluation shows that our method yields state-of-the art results with relatively little computational demands. Our methods is more flexible compared to prior works leveraging word embeddings, since it provides the possibility to customize topic-document distributions using priors. Code is at \url{https://github.com/JohnTailor/BertSenClu}.
翻译:预训练语言模型已在诸多自然语言处理任务中实现最新最优性能。然而在主题建模领域,LDA等统计生成模型仍占主导地位,此类模型难以融入上下文词向量,且可能产生与人类判断契合度不足的主题。本文提出一种新颖的主题建模与推理算法。我们采用以句子为分析单元的句子袋方法,通过融合生成过程模型与聚类技术,充分利用预训练句子嵌入。我们推导出基于期望最大化、硬分配及退火过程的高效推理算法。评估表明,本方法在计算资源需求较低的情况下仍能达到最优性能。相较于先前利用词嵌入的研究,本方法具有更强的灵活性,可通过先验分布实现对主题-文档分布的自定义。代码见\url{https://github.com/JohnTailor/BertSenClu}。