Classical clustering methods do not provide users with direct control of the clustering results, and the clustering results may not be consistent with the relevant criterion that a user has in mind. In this work, we present a new methodology for performing image clustering based on user-specified text criteria by leveraging modern vision-language models and large language models. We call our method Image Clustering Conditioned on Text Criteria (IC$|$TC), and it represents a different paradigm of image clustering. IC$|$TC requires a minimal and practical degree of human intervention and grants the user significant control over the clustering results in return. Our experiments show that IC$|$TC can effectively cluster images with various criteria, such as human action, physical location, or the person's mood, while significantly outperforming baselines.
翻译:经典聚类方法无法为用户提供对聚类结果的直接控制,且聚类结果可能与用户心中期望的相关准则不一致。本研究提出一种基于用户指定文本条件进行图像聚类的新方法,通过结合现代视觉-语言模型与大型语言模型实现。我们将该方法命名为"基于文本条件的图像聚类"(IC$|$TC),它代表了图像聚类的一种新范式。IC$|$TC仅需最小限度的实用人工干预,同时赋予用户对聚类结果的显著控制能力。实验表明,IC$|$TC能有效针对人类行为、物理位置或人物情绪等各类准则进行图像聚类,其性能显著优于基线方法。