Classical clustering methods do not provide users with direct control of the clustering results, and the clustering results may not be consistent with the relevant criterion that a user has in mind. In this work, we present a new methodology for performing image clustering based on user-specified text criteria by leveraging modern vision-language models and large language models. We call our method Image Clustering Conditioned on Text Criteria (IC|TC), and it represents a different paradigm of image clustering. IC|TC requires a minimal and practical degree of human intervention and grants the user significant control over the clustering results in return. Our experiments show that IC|TC can effectively cluster images with various criteria, such as human action, physical location, or the person's mood, while significantly outperforming baselines.
翻译:经典聚类方法无法让用户直接控制聚类结果,且聚类结果可能与用户期望的相关标准不一致。本研究提出了一种基于用户指定文本标准进行图像聚类的新方法,该方法通过利用现代视觉语言模型和大语言模型实现。我们将该方法称为"基于文本条件的图像聚类"(IC|TC),它代表了图像聚类的不同范式。IC|TC需要最小且实用的人工干预,同时赋予用户对聚类结果的显著控制权。实验表明,IC|TC能够有效根据多种标准(如人类动作、物理位置或个人情绪)对图像进行聚类,且性能显著优于基线方法。