Topic models have been prevalent for decades with various applications. However, existing topic models commonly suffer from the notorious topic collapsing: discovered topics semantically collapse towards each other, leading to highly repetitive topics, insufficient topic discovery, and damaged model interpretability. In this paper, we propose a new neural topic model, Embedding Clustering Regularization Topic Model (ECRTM). Besides the existing reconstruction error, we propose a novel Embedding Clustering Regularization (ECR), which forces each topic embedding to be the center of a separately aggregated word embedding cluster in the semantic space. This enables each produced topic to contain distinct word semantics, which alleviates topic collapsing. Regularized by ECR, our ECRTM generates diverse and coherent topics together with high-quality topic distributions of documents. Extensive experiments on benchmark datasets demonstrate that ECRTM effectively addresses the topic collapsing issue and consistently surpasses state-of-the-art baselines in terms of topic quality, topic distributions of documents, and downstream classification tasks.
翻译:主题模型数十年来在各种应用中一直盛行。然而,现有主题模型普遍存在严重的主题坍缩问题:所发现的主题在语义上相互趋同,导致主题高度重复、主题发现不足以及模型可解释性受损。本文提出一种新的神经主题模型——嵌入聚类正则化主题模型(ECRTM)。除了现有的重构误差外,我们提出了一种新颖的嵌入聚类正则化(ECR)方法,该方法迫使每个主题嵌入成为语义空间中独立聚合的word嵌入聚类的中心。这使得每个生成的主题包含独特的词语语义,从而缓解了主题坍缩问题。通过ECR的正则化,我们的ECRTM能够生成多样且连贯的主题,并附带高质量的主题分布。在基准数据集上的大量实验表明,ECRTM有效解决了主题坍缩问题,并在主题质量、文档主题分布以及下游分类任务中持续超越现有最优基线方法。