Large language models (LLM)'s are increasingly used for topic modeling outperforming classical topic models such as LDA. Commonly, pre-trained LLM encoders such as BERT are used out-of-the-box despite the fact that fine-tuning is known to improve LLMs considerably. The challenge lies in obtaining a suitable (labeled) dataset for fine-tuning. In this paper, we use the recent idea to use bag of sentences as the elementary unit in computing topics. In turn, we derive an approach FT-Topic to perform unsupervised fine-tuning relying primarily on two steps for constructing a training dataset in an automatic fashion. First, a heuristic method to identifies pairs of sentence groups that are either assumed to be of the same or different topics. Second, we remove sentence pairs that are likely labeled incorrectly. The dataset is then used to fine-tune an encoder LLM, which can be leveraged by any topic modeling approach using embeddings. However, in this work, we demonstrate its effectiveness by deriving a novel state-of-the-art topic modeling method called SenClu, which achieves fast inference through an expectation-maximization algorithm and hard assignments of sentence groups to a single topic, while giving users the possibility to encode prior knowledge on the topic-document distribution. Code is at \url{https://github.com/JohnTailor/FT-Topic}
翻译:大语言模型(LLM)在主题建模任务中的应用日益广泛,其性能已超越LDA等经典主题模型。目前通常直接使用预训练的LLM编码器(如BERT),尽管已知细调能显著提升LLM性能。主要挑战在于获取适合细调的(已标注)数据集。本文采用以句子袋作为主题计算基本单元的最新思路,进而提出FT-Topic方法,通过两个主要步骤自动构建训练数据集以实现无监督细调:首先,采用启发式方法识别被假定为同主题或不同主题的句子组对;其次,剔除可能标注错误的句子对。所得数据集用于细化编码器LLM,该模型可被任何基于嵌入的主题建模方法调用。本研究通过提出名为SenClu的新型最先进主题建模方法验证其有效性:该方法通过期望最大化算法实现快速推理,采用句子组到单一主题的硬分配机制,同时允许用户在主题-文档分布中嵌入先验知识。代码位于\url{https://github.com/JohnTailor/FT-Topic}