We introduce k-NLPmeans and k-LLMmeans, text-clustering variants of k-means that periodically replace numeric centroids with textual summaries. The key idea, summary-as-centroid, retains k-means assignments in embedding space while producing human-readable, auditable cluster prototypes. The method is LLM-optional: k-NLPmeans uses lightweight, deterministic summarizers, enabling offline, low-cost, and stable operation; k-LLMmeans is a drop-in upgrade that uses an LLM for summaries under a fixed per-iteration budget whose cost does not grow with dataset size. We also present a mini-batch extension for real-time clustering of streaming text. Across diverse datasets, embedding models, and summarization strategies, our approach consistently outperforms classical baselines and approaches the accuracy of recent LLM-based clustering-without extensive LLM calls. Finally, we provide a case study on sequential text streams and release a StackExchange-derived benchmark for evaluating streaming text clustering.
翻译:本文提出k-NLPmeans与k-LLMmeans两种文本聚类方法,它们作为k-means的变体,通过周期性将数值质心替换为文本摘要实现创新。其核心思想——"摘要即质心"——在保持嵌入空间中k-means分配机制的同时,生成人类可读、可审计的聚类原型。该方法具备大语言模型(LLM)可选特性:k-NLPmeans采用轻量级确定性摘要生成器,支持离线、低成本且稳定的运行;k-LLMmeans作为即插即用升级方案,在固定单次迭代预算下使用LLM生成摘要,其成本不随数据集规模增长。我们还提出了适用于流式文本实时聚类的迷你批次扩展方案。在多样化数据集、嵌入模型和摘要生成策略的测试中,本方法始终优于经典基线模型,并在未进行大量LLM调用的前提下逼近最新基于LLM的聚类精度。最后,我们提供了针对序列化文本流的案例研究,并发布了基于StackExchange构建的流式文本聚类评估基准数据集。