We introduce k-NLPmeans and k-LLMmeans, text-clustering variants of k-means that periodically replace numeric centroids with textual summaries. The key idea, summary-as-centroid, retains k-means assignments in embedding space while producing human-readable, auditable cluster prototypes. The method is LLM-optional: k-NLPmeans uses lightweight, deterministic summarizers, enabling offline, low-cost, and stable operation; k-LLMmeans is a drop-in upgrade that uses an LLM for summaries under a fixed per-iteration budget whose cost does not grow with dataset size. We also present a mini-batch extension for real-time clustering of streaming text. Across diverse datasets, embedding models, and summarization strategies, our approach consistently outperforms classical baselines and approaches the accuracy of recent LLM-based clustering-without extensive LLM calls. Finally, we provide a case study on sequential text streams and release a StackExchange-derived benchmark for evaluating streaming text clustering.
翻译:本文提出了k-NLPmeans与k-LLMmeans两种文本聚类方法,它们作为k-means的变体,通过周期性将数值中心点替换为文本摘要来实现聚类。其核心思想——"摘要即中心点"——在保持嵌入空间中k-means分配机制的同时,生成人类可读、可审计的聚类原型。该方法具备大语言模型可选特性:k-NLPmeans采用轻量级确定性摘要生成器,支持离线、低成本和稳定运行;k-LLMmeans作为即插即用升级方案,在固定单次迭代预算下使用大语言模型生成摘要,其成本不随数据集规模增长。我们还提出了适用于流式文本实时聚类的迷你批次扩展方案。在多样化数据集、嵌入模型和摘要生成策略的测试中,本方法始终优于经典基线模型,并在无需大量大语言模型调用的前提下逼近近期基于大语言模型的聚类精度。最后,我们提供了关于序列文本流的案例研究,并发布了基于StackExchange衍生的基准数据集用于评估流式文本聚类性能。