Vocabulary-free Image Classification

Recent advances in large vision-language models have revolutionized the image classification paradigm. Despite showing impressive zero-shot capabilities, a pre-defined set of categories, a.k.a. the vocabulary, is assumed at test time for composing the textual prompts. However, such assumption can be impractical when the semantic context is unknown and evolving. We thus formalize a novel task, termed as Vocabulary-free Image Classification (VIC), where we aim to assign to an input image a class that resides in an unconstrained language-induced semantic space, without the prerequisite of a known vocabulary. VIC is a challenging task as the semantic space is extremely large, containing millions of concepts, with hard-to-discriminate fine-grained categories. In this work, we first empirically verify that representing this semantic space by means of an external vision-language database is the most effective way to obtain semantically relevant content for classifying the image. We then propose Category Search from External Databases (CaSED), a method that exploits a pre-trained vision-language model and an external vision-language database to address VIC in a training-free manner. CaSED first extracts a set of candidate categories from captions retrieved from the database based on their semantic similarity to the image, and then assigns to the image the best matching candidate category according to the same vision-language model. Experiments on benchmark datasets validate that CaSED outperforms other complex vision-language frameworks, while being efficient with much fewer parameters, paving the way for future research in this direction.

翻译：近期，大规模视觉语言模型的进展彻底改变了图像分类范式。尽管这些模型展现出令人瞩目的零样本能力，但在测试阶段仍需预设类别集合（即词汇表）来构建文本提示。然而，当语义语境未知且动态变化时，这一假设可能难以实际应用。为此，我们提出一项名为"无词汇前提图像分类"（VIC）的新型任务，旨在无需已知词汇表的前提下，将输入图像归类至无约束语言驱动的语义空间中。VIC极具挑战性，因其语义空间包含数百万个概念，且存在难以区分的细粒度类别。本研究首先通过实验证明，利用外部视觉语言数据库表征该语义空间，是获取图像分类所需语义相关内容的最有效途径。我们进而提出"基于外部数据库的类别搜索"（CaSED）方法——该方法利用预训练的视觉语言模型与外部视觉语言数据库，以无需训练的方式解决VIC任务。CaSED首先根据候选类别与图像的语义相似度，从数据库中检索的文本描述中提取候选类别集合；随后通过同一视觉语言模型为图像分配最佳匹配类别。基准数据集实验表明，CaSED在参数更少且保持高效性的前提下，性能优于其他复杂的视觉语言框架，为未来该方向的研究奠定了基础。