Topic modelling was mostly dominated by Bayesian graphical models during the last decade. With the rise of transformers in Natural Language Processing, however, several successful models that rely on straightforward clustering approaches in transformer-based embedding spaces have emerged and consolidated the notion of topics as clusters of embedding vectors. We propose the Transformer-Representation Neural Topic Model (TNTM), which combines the benefits of topic representations in transformer-based embedding spaces and probabilistic modelling. Therefore, this approach unifies the powerful and versatile notion of topics based on transformer embeddings with fully probabilistic modelling, as in models such as Latent Dirichlet Allocation (LDA). We utilize the variational autoencoder (VAE) framework for improved inference speed and modelling flexibility. Experimental results show that our proposed model achieves results on par with various state-of-the-art approaches in terms of embedding coherence while maintaining almost perfect topic diversity. The corresponding source code is available at https://github.com/ArikReuter/TNTM.
翻译:过去十年中,主题建模主要受贝叶斯图模型主导。然而,随着Transformer在自然语言处理领域的兴起,一些成功模型基于Transformer嵌入空间中的直接聚类方法出现,并巩固了将主题视为嵌入向量聚类的概念。我们提出Transformer表示神经主题模型(TNTM),该模型结合了基于Transformer嵌入空间的表示与概率建模的优势。因此,该方法将基于Transformer嵌入的强大且通用的主题概念与全概率建模(如潜在狄利克雷分配(LDA))统一起来。我们利用变分自编码器(VAE)框架以提高推理速度和建模灵活性。实验结果表明,我们提出的模型在嵌入连贯性方面可与多种先进方法媲美,同时几乎完美保持主题多样性。相关源代码可在https://github.com/ArikReuter/TNTM获取。