Uncovering hidden topics from short texts is challenging for traditional and neural models due to data sparsity, which limits word co-occurrence patterns, and label sparsity, stemming from incomplete reconstruction targets. Although data aggregation offers a potential solution, existing neural topic models often overlook it due to time complexity, poor aggregation quality, and difficulty in inferring topic proportions for individual documents. In this paper, we propose a novel model, GloCOM (Global Clustering COntexts for Topic Models), which addresses these challenges by constructing aggregated global clustering contexts for short documents, leveraging text embeddings from pre-trained language models. GloCOM can infer both global topic distributions for clustering contexts and local distributions for individual short texts. Additionally, the model incorporates these global contexts to augment the reconstruction loss, effectively handling the label sparsity issue. Extensive experiments on short text datasets show that our approach outperforms other state-of-the-art models in both topic quality and document representations.
翻译:从短文本中挖掘潜在主题对传统模型和神经模型均构成挑战,这主要源于数据稀疏性(限制了词共现模式)和标签稀疏性(源于不完整的重构目标)。尽管数据聚合提供了潜在的解决方案,但现有神经主题模型常因时间复杂性高、聚合质量差以及难以推断单个文档的主题比例而忽视该方法。本文提出一种新颖模型GloCOM(面向主题模型的全局聚类上下文),该模型通过利用预训练语言模型生成的文本嵌入,为短文档构建聚合的全局聚类上下文,从而应对上述挑战。GloCOM能够同时推断聚类上下文的全局主题分布和单个短文本的局部主题分布。此外,该模型通过整合这些全局上下文来增强重构损失函数,有效解决了标签稀疏性问题。在多个短文本数据集上的大量实验表明,本方法在主题质量与文档表征方面均优于其他最先进的模型。