Topic modelling, as a well-established unsupervised technique, has found extensive use in automatically detecting significant topics within a corpus of documents. However, classic topic modelling approaches (e.g., LDA) have certain drawbacks, such as the lack of semantic understanding and the presence of overlapping topics. In this work, we investigate the untapped potential of large language models (LLMs) as an alternative for uncovering the underlying topics within extensive text corpora. To this end, we introduce a framework that prompts LLMs to generate topics from a given set of documents and establish evaluation protocols to assess the clustering efficacy of LLMs. Our findings indicate that LLMs with appropriate prompts can stand out as a viable alternative, capable of generating relevant topic titles and adhering to human guidelines to refine and merge topics. Through in-depth experiments and evaluation, we summarise the advantages and constraints of employing LLMs in topic extraction.
翻译:主题建模作为一种成熟的无监督技术,已广泛用于自动检测文档语料库中的重要主题。然而,经典的主题建模方法(如LDA)存在某些缺陷,例如缺乏语义理解以及主题重叠现象。本研究探索了大型语言模型(LLM)作为揭示大规模文本语料库中潜在主题的替代方案中尚未被开发的潜力。为此,我们提出了一种框架,通过提示LLM从给定文档集合生成主题,并建立评估协议来评估LLM的聚类效果。我们的研究结果表明,配备适当提示的LLM可作为可行的替代方案,能够生成相关主题标题,并遵循人类指导对主题进行精炼与合并。通过深入的实验与评估,我们总结了将LLM应用于主题提取的优势与局限性。