Topic modelling, as a well-established unsupervised technique, has found extensive use in automatically detecting significant topics within a corpus of documents. However, classic topic modelling approaches (e.g., LDA) have certain drawbacks, such as the lack of semantic understanding and the presence of overlapping topics. In this work, we investigate the untapped potential of large language models (LLMs) as an alternative for uncovering the underlying topics within extensive text corpora. To this end, we introduce a framework that prompts LLMs to generate topics from a given set of documents and establish evaluation protocols to assess the clustering efficacy of LLMs. Our findings indicate that LLMs with appropriate prompts can stand out as a viable alternative, capable of generating relevant topic titles and adhering to human guidelines to refine and merge topics. Through in-depth experiments and evaluation, we summarise the advantages and constraints of employing LLMs in topic extraction.
翻译:主题建模作为一种成熟的无监督技术,广泛应用于自动检测文档语料库中的重要主题。然而,经典主题建模方法(如LDA)存在语义理解不足和主题重叠等缺陷。本研究探索了大型语言模型作为揭示海量文本语料中潜在主题的替代方案的未开发潜力。为此,我们构建了一个框架,通过提示大型语言模型从给定文档集中生成主题,并建立评估协议来检验其聚类效能。研究结果表明,采用适当提示的大型语言模型能够生成相关主题标题、遵循人类指导优化与合并主题,从而成为可行的替代方案。通过深入的实验与评估,我们总结了大型语言模型在主题提取中的优势与局限性。