Classical clustering methods do not provide users with direct control of the clustering results, and the clustering results may not be consistent with the relevant criterion that a user has in mind. In this work, we present a new methodology for performing image clustering based on user-specified text criteria by leveraging modern vision-language models and large language models. We call our method Image Clustering Conditioned on Text Criteria (IC|TC), and it represents a different paradigm of image clustering. IC|TC requires a minimal and practical degree of human intervention and grants the user significant control over the clustering results in return. Our experiments show that IC|TC can effectively cluster images with various criteria, such as human action, physical location, or the person's mood, while significantly outperforming baselines.
翻译:经典聚类方法无法让用户直接控制聚类结果,且聚类结果可能与用户预期的相关标准不一致。本文提出一种新方法,通过利用现代视觉-语言模型和大语言模型,基于用户指定的文本条件执行图像聚类。我们将该方法命名为"基于文本条件的图像聚类"(IC|TC),它代表了图像聚类的另一种范式。IC|TC只需最低限度且实际的人工干预,却能赋予用户对聚类结果的显著控制权。实验表明,IC|TC能够根据多种条件(如人类动作、物理位置或人物情绪)有效聚类图像,且性能显著优于基线方法。