We introduce k-LLMmeans, a novel modification of the k-means clustering algorithm that utilizes LLMs to generate textual summaries as cluster centroids, thereby capturing contextual and semantic nuances often lost when relying on purely numerical means of document embeddings. This modification preserves the properties of k-means while offering greater interpretability: the cluster centroid is represented by an LLM-generated summary, whose embedding guides cluster assignments. We also propose a mini-batch variant, enabling efficient online clustering for streaming text data and providing real-time interpretability of evolving cluster centroids. Through extensive simulations, we show that our methods outperform vanilla k-means on multiple metrics while incurring only modest LLM usage that does not scale with dataset size. Finally, We present a case study showcasing the interpretability of evolving cluster centroids in sequential text streams. As part of our evaluation, we compile a new dataset from StackExchange, offering a benchmark for text-stream clustering.
翻译:本文介绍了k-LLMmeans,这是对k-means聚类算法的一种新颖改进,它利用大语言模型(LLM)生成文本摘要作为聚类质心,从而捕捉了在单纯依赖文档嵌入的数值均值时常常丢失的上下文和语义细微差别。这一改进保留了k-means算法的特性,同时提供了更强的可解释性:聚类质心由LLM生成的摘要表示,其嵌入向量指导着聚类分配。我们还提出了一种小批量变体,能够对流式文本数据进行高效的在线聚类,并提供对演化的聚类质心的实时可解释性。通过大量模拟实验,我们证明我们的方法在多项指标上优于原始k-means算法,同时仅需适度的LLM调用,且其开销不随数据集规模扩大而增加。最后,我们展示了一个案例研究,说明了在连续文本流中演化聚类质心的可解释性。作为评估的一部分,我们从StackExchange平台整理了一个新的数据集,为文本流聚类提供了一个基准。