Large language models (LLMs) are increasingly used for topic modeling, outperforming classical topic models such as LDA. Commonly, pre-trained LLM encoders such as BERT are used out-of-the-box despite the fact that fine-tuning is known to improve LLMs considerably. The challenge lies in obtaining a suitable labeled dataset for fine-tuning. In this paper, we build on the recent idea of using bags of sentences as the elementary unit for computing topics. Based on this idea, we derive an approach called FT-Topic to perform unsupervised fine-tuning, relying primarily on two steps for constructing a training dataset in an automatic fashion. First, a heuristic method identifies pairs of sentence groups that are assumed to belong either to the same topic or to different topics. Second, we remove sentence pairs that are likely labeled incorrectly. The resulting dataset is then used to fine-tune an encoder LLM, which can be leveraged by any topic modeling approach that uses embeddings. In this work, we demonstrate its effectiveness by deriving a novel state-of-the-art topic modeling method called SenClu. The method achieves fast inference through an expectation-maximization algorithm and hard assignments of sentence groups to a single topic, while allowing users to encode prior knowledge about the topic-document distribution. Code is available at https://github.com/JohnTailor/FT-Topic
翻译:大语言模型(LLMs)在主题建模中的应用日益广泛,其性能已超越潜在狄利克雷分布(LDA)等经典主题模型。通常,研究者直接使用预训练的LLM编码器(如BERT),而事实上细调能显著提升LLM性能。主要挑战在于获取适用于细调的标注数据集。本文基于近期将句子袋作为主题计算基本单元的研究思路,提出一种名为FT-Topic的无监督细调方法。该方法主要通过两个自动化步骤构建训练数据集:首先,采用启发式方法识别被假定属于相同主题或不同主题的句子组对;其次,剔除可能标注错误的句子对。所得数据集用于细化编码器LLM,该模型可被任何基于嵌入的主题建模方法调用。本研究通过提出新型前沿主题建模方法SenClu验证了其有效性。该方法通过期望最大化算法和将句子组硬性分配至单一主题实现快速推理,同时允许用户对主题-文档分布的先验知识进行编码。代码发布于https://github.com/JohnTailor/FT-Topic