Image clustering divides a collection of images into meaningful groups, typically interpreted post-hoc via human-given annotations. Those are usually in the form of text, begging the question of using text as an abstraction for image clustering. Current image clustering methods, however, neglect the use of generated textual descriptions. We, therefore, propose Text-Guided Image Clustering, i.e., generating text using image captioning and visual question-answering (VQA) models and subsequently clustering the generated text. Further, we introduce a novel approach to inject task- or domain knowledge for clustering by prompting VQA models. Across eight diverse image clustering datasets, our results show that the obtained text representations often outperform image features. Additionally, we propose a counting-based cluster explainability method. Our evaluations show that the derived keyword-based explanations describe clusters better than the respective cluster accuracy suggests. Overall, this research challenges traditional approaches and paves the way for a paradigm shift in image clustering, using generated text.
翻译:图像聚类将一组图像划分为有意义的群组,通常随后通过人工标注进行解读。这些标注通常以文本形式呈现,因此自然引发了一个问题:能否将文本作为图像聚类的抽象表示?然而,当前的图像聚类方法忽略了利用生成的文本描述。为此,我们提出了文本引导的图像聚类方法,即通过图像描述生成和视觉问答模型生成文本,随后对生成的文本进行聚类。此外,我们引入了一种新颖的方法,通过提示视觉问答模型注入任务或领域知识以进行聚类。在八个不同的图像聚类数据集上,我们的结果表明,获得的文本表示通常优于图像特征。此外,我们提出了一种基于计数的聚类可解释性方法。我们的评估表明,所得到的基于关键词的解释对聚类的描述效果优于相应的聚类准确率所暗示的水平。总体而言,本研究挑战了传统方法,并为利用生成文本的图像聚类开辟了范式转变的新路径。