Topic modeling is widely used for uncovering thematic structures within text corpora, yet traditional models often struggle with specificity and coherence in domain-focused applications. Guided approaches, such as SeededLDA and CorEx, incorporate user-provided seed words to improve relevance but remain labor-intensive and static. Large language models (LLMs) offer potential for dynamic topic refinement and discovery, yet their application often incurs high API costs. To address these challenges, we propose the LLM-assisted Iterative Topic Augmentation framework (LITA), an LLM-assisted approach that integrates user-provided seeds with embedding-based clustering and iterative refinement. LITA identifies a small number of ambiguous documents and employs an LLM to reassign them to existing or new topics, minimizing API costs while enhancing topic quality. Experiments on two datasets across topic quality and clustering performance metrics demonstrate that LITA outperforms five baseline models, including LDA, SeededLDA, CorEx, BERTopic, and PromptTopic. Our work offers an efficient and adaptable framework for advancing topic modeling and text clustering.
翻译:主题建模被广泛用于揭示文本语料库中的主题结构,但传统模型在面向领域的应用中常难以保证主题的特定性和连贯性。引导式方法,如SeededLDA和CorEx,通过引入用户提供的种子词来提升相关性,但仍需大量人工且缺乏动态性。大语言模型(LLMs)为动态主题优化与发现提供了潜力,但其应用往往伴随高昂的API成本。为应对这些挑战,我们提出了LLM辅助的迭代主题增强框架(LITA)。这是一种LLM辅助方法,将用户提供的种子词与基于嵌入的聚类及迭代优化相结合。LITA识别少量模糊文档,并利用LLM将其重新分配到现有或新主题中,从而在提升主题质量的同时最小化API成本。在两个数据集上,针对主题质量和聚类性能指标的实验表明,LITA在包括LDA、SeededLDA、CorEx、BERTopic和PromptTopic在内的五种基线模型中表现最优。我们的工作为推进主题建模与文本聚类提供了一个高效且适应性强的框架。