Text clustering remains valuable in real-world applications where manual labeling is cost-prohibitive. It facilitates efficient organization and analysis of information by grouping similar texts based on their representations. However, implementing this approach necessitates fine-tuned embedders for downstream data and sophisticated similarity metrics. To address this issue, this study presents a novel framework for text clustering that effectively leverages the in-context learning capacity of Large Language Models (LLMs). Instead of fine-tuning embedders, we propose to transform the text clustering into a classification task via LLM. First, we prompt LLM to generate potential labels for a given dataset. Second, after integrating similar labels generated by the LLM, we prompt the LLM to assign the most appropriate label to each sample in the dataset. Our framework has been experimentally proven to achieve comparable or superior performance to state-of-the-art clustering methods that employ embeddings, without requiring complex fine-tuning or clustering algorithms. We make our code available to the public for utilization at https://anonymous.4open.science/r/Text-Clustering-via-LLM-E500.
翻译:文本聚类在现实应用中仍具有重要价值,尤其在人工标注成本过高的情况下。该方法通过基于文本表征的相似性分组,促进信息的高效组织与分析。然而,实施此方法需要针对下游数据微调嵌入模型,并采用复杂的相似性度量标准。为解决这一问题,本研究提出了一种新颖的文本聚类框架,该框架有效利用了大语言模型(LLMs)的上下文学习能力。我们不再微调嵌入模型,而是通过LLM将文本聚类任务转化为分类任务。首先,我们提示LLM为给定数据集生成潜在标签;其次,在整合LLM生成的相似标签后,我们再次提示LLM为数据集中的每个样本分配最合适的标签。实验证明,我们的框架在不需复杂微调或聚类算法的情况下,其性能已达到或超越了当前采用嵌入技术的先进聚类方法。相关代码已公开于 https://anonymous.4open.science/r/Text-Clustering-via-LLM-E500 供研究使用。