We introduce k-LLMmeans, a novel modification of the k-means algorithm for text clustering that leverages LLM-generated summaries as cluster centroids, capturing semantic nuances often missed by purely numerical averages. This design preserves the core optimization properties of k-means while enhancing semantic interpretability and avoiding the scalability and instability issues typical of modern LLM-based clustering. Unlike existing methods, our approach does not increase LLM usage with dataset size and produces transparent intermediate outputs. We further extend it with a mini-batch variant for efficient, real-time clustering of streaming text. Extensive experiments across multiple datasets, embeddings, and LLMs show that k-LLMmeans consistently outperforms k-means and other traditional baselines and achieves results comparable to state-of-the-art LLM-based clustering, with a fraction of the LLM calls. Finally, we present a case study on sequential text streams and introduce a new benchmark dataset constructed from StackExchange to evaluate text-stream clustering methods.
翻译:我们提出了k-LLMmeans,一种针对文本聚类的k-means算法新颖改进方法,其利用LLM生成的摘要作为聚类质心,能够捕捉纯数值平均常忽略的语义细微差别。该设计在保持k-means核心优化特性的同时,增强了语义可解释性,并避免了现代基于LLM的聚类方法常见的可扩展性与稳定性问题。与现有方法不同,我们的方法不会随数据集规模增加LLM使用量,并能生成透明的中间输出。我们进一步通过小批量变体扩展该方法,以实现流式文本的高效实时聚类。在多个数据集、嵌入模型和LLM上的大量实验表明,k-LLMmeans始终优于k-means及其他传统基线方法,并以极少的LLM调用量取得了与最先进的基于LLM的聚类方法相当的结果。最后,我们展示了针对序列文本流的案例研究,并构建了一个基于StackExchange的新基准数据集,用于评估文本流聚类方法。