Existing machine learning models demonstrate excellent performance in image object recognition after training on a large-scale dataset under full supervision. However, these models only learn to map an image to a predefined class index, without revealing the actual semantic meaning of the object in the image. In contrast, vision-language models like CLIP are able to assign semantic class names to unseen objects in a `zero-shot' manner, although they still rely on a predefined set of candidate names at test time. In this paper, we reconsider the recognition problem and task a vision-language model to assign class names to images given only a large and essentially unconstrained vocabulary of categories as prior information. We use non-parametric methods to establish relationships between images which allow the model to automatically narrow down the set of possible candidate names. Specifically, we propose iteratively clustering the data and voting on class names within them, showing that this enables a roughly 50\% improvement over the baseline on ImageNet. Furthermore, we tackle this problem both in unsupervised and partially supervised settings, as well as with a coarse-grained and fine-grained search space as the unconstrained dictionary.
翻译:现有机器学习模型在充分监督下通过大规模数据集训练后,在图像物体识别任务中展现出卓越性能。然而,这些模型仅学习将图像映射至预定义类别索引,未能揭示图像中物体的实际语义内涵。相比之下,CLIP等视觉-语言模型能以“零样本”方式为未见物体分配语义类别名称,但其测试阶段仍需依赖预定义的候选名称集合。本文重新审视识别问题,要求视觉-语言模型仅以大规模、本质无约束的类别词汇表作为先验信息,为图像分配类别名称。我们采用非参数方法建立图像间的关联,使模型能够自动缩小候选名称范围。具体而言,我们提出对数据进行迭代聚类,并在聚类内部对类别名称进行投票,该方法在ImageNet上相较基线实现约50%的提升。此外,我们分别针对无监督与部分监督场景,以及粗粒度与细粒度搜索空间作为无约束词典的场景,对该问题展开研究。