Topic modeling is a widely used technique for revealing underlying thematic structures within textual data. However, existing models have certain limitations, particularly when dealing with short text datasets that lack co-occurring words. Moreover, these models often neglect sentence-level semantics, focusing primarily on token-level semantics. In this paper, we propose PromptTopic, a novel topic modeling approach that harnesses the advanced language understanding of large language models (LLMs) to address these challenges. It involves extracting topics at the sentence level from individual documents, then aggregating and condensing these topics into a predefined quantity, ultimately providing coherent topics for texts of varying lengths. This approach eliminates the need for manual parameter tuning and improves the quality of extracted topics. We benchmark PromptTopic against the state-of-the-art baselines on three vastly diverse datasets, establishing its proficiency in discovering meaningful topics. Furthermore, qualitative analysis showcases PromptTopic's ability to uncover relevant topics in multiple datasets.
翻译:主题建模是一种广泛用于揭示文本数据中潜在主题结构的技术。然而,现有模型存在一定局限性,尤其是在处理缺乏共现词的短文本数据集时。此外,这些模型通常忽视句子级语义,而主要关注词元级语义。本文提出PromptTopic,一种新颖的主题建模方法,利用大语言模型(LLM)的高级语言理解能力来应对上述挑战。该方法先从每个文档的句子层面提取主题,再将主题聚合浓缩为预设数量,最终为不同长度的文本生成连贯的主题。这种方法无需手动调整参数,并提升了提取主题的质量。我们在三个差异性极大的数据集上,将PromptTopic与最先进的基线方法进行基准测试,验证了其在发现有意义主题方面的能力。此外,定性分析展示了PromptTopic在多个数据集中挖掘相关主题的潜力。