Traditional neural topic models are typically optimized by reconstructing the document's Bag-of-Words (BoW) representations, overlooking contextual information and struggling with data sparsity. In this work, we propose a novel approach to construct semantically-grounded soft label targets using Language Models (LMs) by projecting the next token probabilities, conditioned on a specialized prompt, onto a pre-defined vocabulary to obtain contextually enriched supervision signals. By training the topic models to reconstruct the soft labels using the LM hidden states, our method produces higher-quality topics that are more closely aligned with the underlying thematic structure of the corpus. Experiments on three datasets show that our method achieves substantial improvements in topic coherence, purity over existing baselines. Additionally, we also introduce a retrieval-based metric, which shows that our approach significantly outperforms existing methods in identifying semantically similar documents, highlighting its effectiveness for retrieval-oriented applications.
翻译:传统神经主题模型通常通过重构文档的词袋表示进行优化,这种方法忽略了上下文信息且难以应对数据稀疏性问题。本研究提出一种创新方法,利用语言模型构建基于语义的软标签目标:通过将基于特定提示条件生成的下一词元概率投影至预定义词汇表,从而获得上下文增强的监督信号。通过使用语言模型隐藏状态训练主题模型重构软标签,本方法能够生成更高质量的主题,这些主题与语料库的潜在主题结构更为契合。在三个数据集上的实验表明,本方法在主题连贯性和纯度方面较现有基线模型均有显著提升。此外,我们还提出了一种基于检索的评估指标,结果显示本方法在识别语义相似文档方面明显优于现有方法,彰显了其在检索导向应用中的有效性。