We propose In-Context Clustering (ICC), a flexible LLM-based procedure for clustering data from diverse distributions. Unlike traditional clustering algorithms constrained by predefined similarity measures, ICC flexibly captures complex relationships among inputs through an attention mechanism. We show that pretrained LLMs exhibit impressive zero-shot clustering capabilities on text-encoded numeric data, with attention matrices showing salient cluster patterns. Spectral clustering using attention matrices offers surprisingly competitive performance. We further enhance the clustering capabilities of LLMs on numeric and image data through fine-tuning using the Next Token Prediction (NTP) loss. Moreover, the flexibility of LLM prompting enables text-conditioned image clustering, a capability that classical clustering methods lack. Our work extends in-context learning to an unsupervised setting, showcasing the effectiveness and flexibility of LLMs for clustering. Our code is available at https://agenticlearning.ai/icc.
翻译:我们提出上下文聚类(ICC),这是一种灵活、基于大语言模型(LLM)的流程,用于对来自不同分布的数据进行聚类。与受限于预定义相似性度量的传统聚类算法不同,ICC通过注意力机制灵活地捕捉输入之间的复杂关系。我们证明,预训练的大语言模型在文本编码的数值数据上展现出令人印象深刻的零样本聚类能力,其注意力矩阵显示出显著的聚类模式。使用注意力矩阵进行谱聚类提供了出人意料的高竞争力性能。我们进一步通过使用下一词预测(NTP)损失进行微调,增强了大语言模型在数值和图像数据上的聚类能力。此外,大语言模型提示的灵活性使得文本条件图像聚类成为可能,这是经典聚类方法所不具备的能力。我们的工作将上下文学习扩展到无监督设置,展示了大语言模型用于聚类的有效性和灵活性。我们的代码可在 https://agenticlearning.ai/icc 获取。