This paper introduces a novel approach for topic modeling utilizing latent codebooks from Vector-Quantized Variational Auto-Encoder~(VQ-VAE), discretely encapsulating the rich information of the pre-trained embeddings such as the pre-trained language model. From the novel interpretation of the latent codebooks and embeddings as conceptual bag-of-words, we propose a new generative topic model called Topic-VQ-VAE~(TVQ-VAE) which inversely generates the original documents related to the respective latent codebook. The TVQ-VAE can visualize the topics with various generative distributions including the traditional BoW distribution and the autoregressive image generation. Our experimental results on document analysis and image generation demonstrate that TVQ-VAE effectively captures the topic context which reveals the underlying structures of the dataset and supports flexible forms of document generation. Official implementation of the proposed TVQ-VAE is available at https://github.com/clovaai/TVQ-VAE.
翻译:本文提出了一种新颖的主题建模方法,利用向量量化变分自编码器(VQ-VAE)中的潜在码本,离散地封装了预训练嵌入(如预训练语言模型)的丰富信息。基于对潜在码本和嵌入作为概念词袋的新颖解读,我们提出了一种新的生成式主题模型——Topic-VQ-VAE(TVQ-VAE),该模型能够逆向生成与相应潜在码本相关的原始文档。TVQ-VAE能够通过多种生成分布(包括传统的词袋分布和自回归图像生成)对主题进行可视化。我们在文档分析和图像生成上的实验结果表明,TVQ-VAE有效捕捉了主题上下文,揭示了数据集的潜在结构,并支持灵活形式的文档生成。所提出的TVQ-VAE官方实现可访问 https://github.com/clovaai/TVQ-VAE。