Topic models are widely used to discover the latent representation of a set of documents. The two canonical models are latent Dirichlet allocation, and Gaussian latent Dirichlet allocation, where the former uses multinomial distributions over words, and the latter uses multivariate Gaussian distributions over pre-trained word embedding vectors as the latent topic representations, respectively. Compared with latent Dirichlet allocation, Gaussian latent Dirichlet allocation is limited in the sense that it does not capture the polysemy of a word such as ``bank.'' In this paper, we show that Gaussian latent Dirichlet allocation could recover the ability to capture polysemy by introducing a hierarchical structure in the set of topics that the model can use to represent a given document. Our Gaussian hierarchical latent Dirichlet allocation significantly improves polysemy detection compared with Gaussian-based models and provides more parsimonious topic representations compared with hierarchical latent Dirichlet allocation. Our extensive quantitative experiments show that our model also achieves better topic coherence and held-out document predictive accuracy over a wide range of corpus and word embedding vectors.
翻译:主题模型被广泛用于发现文档集合的潜在表征。两种经典模型分别是潜在狄利克雷分配和高斯潜在狄利克雷分配,前者使用词语的多项分布,后者使用预训练词嵌入向量的多元高斯分布作为潜在主题表征。与潜在狄利克雷分配相比,高斯潜在狄利克雷分配的局限性在于无法捕捉单词(如“bank”)的多义性。本文证明,通过在模型用于表示给定文档的主题集合中引入分层结构,高斯潜在狄利克雷分配能够恢复捕捉多义性的能力。与基于高斯分布的模型相比,我们的高斯分层潜在狄利克雷分配显著提升了多义性检测效果,同时与分层潜在狄利克雷分配相比,提供了更简洁的主题表征。广泛的定量实验表明,我们的模型在多种语料库和词嵌入向量上实现了更优的主题连贯性和留存文档预测准确性。