We present {\em generative clustering} (GC) for clustering a set of documents, $\mathrm{X}$, by using texts $\mathrm{Y}$ generated by large language models (LLMs) instead of by clustering the original documents $\mathrm{X}$. Because LLMs provide probability distributions, the similarity between two documents can be rigorously defined in an information-theoretic manner by the KL divergence. We also propose a natural, novel clustering algorithm by using importance sampling. We show that GC achieves the state-of-the-art performance, outperforming any previous clustering method often by a large margin. Furthermore, we show an application to generative document retrieval in which documents are indexed via hierarchical clustering and our method improves the retrieval accuracy.
翻译:我们提出了一种生成式聚类方法,用于对文档集合 $\mathrm{X}$ 进行聚类,该方法利用大型语言模型生成的文本 $\mathrm{Y}$ 而非直接对原始文档 $\mathrm{X}$ 进行聚类。由于大型语言模型提供了概率分布,两个文档之间的相似度可以通过 KL 散度以信息论的方式严格定义。我们还提出了一种基于重要性采样的自然、新颖的聚类算法。实验表明,生成式聚类达到了最先进的性能,通常以较大优势超越以往的任何聚类方法。此外,我们展示了其在生成式文档检索中的应用,其中文档通过层次聚类建立索引,而我们的方法显著提升了检索准确率。