Image clustering divides a collection of images into meaningful groups, typically interpreted post-hoc via human-given annotations. Those are usually in the form of text, begging the question of using text as an abstraction for image clustering. Current image clustering methods, however, neglect the use of generated textual descriptions. We, therefore, propose Text-Guided Image Clustering, i.e., generating text using image captioning and visual question-answering (VQA) models and subsequently clustering the generated text. Further, we introduce a novel approach to inject task- or domain knowledge for clustering by prompting VQA models. Across eight diverse image clustering datasets, our results show that the obtained text representations often outperform image features. Additionally, we propose a counting-based cluster explainability method. Our evaluations show that the derived keyword-based explanations describe clusters better than the respective cluster accuracy suggests. Overall, this research challenges traditional approaches and paves the way for a paradigm shift in image clustering, using generated text.
翻译:图像聚类将图像集合划分为有意义的组别,通常通过事后的人类标注进行解释。这些标注通常以文本形式呈现,这引发了一个问题:能否将文本作为图像聚类的抽象表示?然而,当前的图像聚类方法忽视了生成文本描述的使用。因此,我们提出文本引导的图像聚类,即利用图像描述和视觉问答(VQA)模型生成文本,随后对生成的文本进行聚类。此外,我们引入了一种新颖的方法,通过提示VQA模型来注入任务或领域知识以进行聚类。在八个不同的图像聚类数据集上,我们的结果表明,所获得的文本表示往往优于图像特征。进一步地,我们提出了一种基于计数的聚类可解释性方法。评估显示,基于关键词的解释比相应聚类准确率所暗示的结果更能描述聚类。总体而言,本研究挑战了传统方法,并为利用生成文本的图像聚类范式转变铺平了道路。