Unlike traditional unsupervised clustering, semi-supervised clustering allows users to provide meaningful structure to the data, which helps the clustering algorithm to match the user's intent. Existing approaches to semi-supervised clustering require a significant amount of feedback from an expert to improve the clusters. In this paper, we ask whether a large language model can amplify an expert's guidance to enable query-efficient, few-shot semi-supervised text clustering. We show that LLMs are surprisingly effective at improving clustering. We explore three stages where LLMs can be incorporated into clustering: before clustering (improving input features), during clustering (by providing constraints to the clusterer), and after clustering (using LLMs post-correction). We find incorporating LLMs in the first two stages can routinely provide significant improvements in cluster quality, and that LLMs enable a user to make trade-offs between cost and accuracy to produce desired clusters. We release our code and LLM prompts for the public to use.
翻译:与传统的无监督聚类不同,半监督聚类允许用户为数据提供有意义的结构,从而帮助聚类算法匹配用户意图。现有的半监督聚类方法需要专家提供大量反馈才能改善聚类效果。在本文中,我们探究大型语言模型能否放大专家指导的作用,以实现查询高效、少样本的半监督文本聚类。研究表明,LLMs在改进聚类效果方面出人意料地有效。我们探索了LLMs可融入聚类的三个阶段:聚类前(改进输入特征)、聚类中(为聚类器提供约束)和聚类后(利用LLMs进行后修正)。我们发现,在前两个阶段融入LLMs通常能显著提升聚类质量,且LLMs使用户能够在成本与准确性之间进行权衡,从而获得理想聚类结果。我们公开了相关代码和LLM提示以供公众使用。