We argue that there are many notions of 'similarity' and that models, like humans, should be able to adapt to these dynamically. This contrasts with most representation learning methods, supervised or self-supervised, which learn a fixed embedding function and hence implicitly assume a single notion of similarity. For instance, models trained on ImageNet are biased towards object categories, while a user might prefer the model to focus on colors, textures or specific elements in the scene. In this paper, we propose the GeneCIS ('genesis') benchmark, which measures models' ability to adapt to a range of similarity conditions. Extending prior work, our benchmark is designed for zero-shot evaluation only, and hence considers an open-set of similarity conditions. We find that baselines from powerful CLIP models struggle on GeneCIS and that performance on the benchmark is only weakly correlated with ImageNet accuracy, suggesting that simply scaling existing methods is not fruitful. We further propose a simple, scalable solution based on automatically mining information from existing image-caption datasets. We find our method offers a substantial boost over the baselines on GeneCIS, and further improves zero-shot performance on related image retrieval benchmarks. In fact, though evaluated zero-shot, our model surpasses state-of-the-art supervised models on MIT-States. Project page at https://sgvaze.github.io/genecis/.
翻译:我们认为“相似度”存在多种概念,模型应如同人类一般能够动态适应这些概念。这与大多数监督或自监督表示学习方法形成鲜明对比——这些方法学习固定的嵌入函数,从而隐含地假设单一的相似度概念。例如,在ImageNet上训练的模型会偏向于物体类别,而用户可能更希望模型关注场景中的颜色、纹理或特定元素。本文提出GeneCIS(“创世纪”)基准,用于衡量模型适应多种相似度条件的能力。该基准扩展了先前工作,仅设计用于零样本评估,因此考虑开放式的相似度条件集合。我们发现,强大的CLIP模型基线在GeneCIS上表现不佳,且基准性能与ImageNet准确率仅呈弱相关,表明简单扩展现有方法并非有效途径。我们进一步提出一种简单可扩展的解决方案,基于自动挖掘现有图像-标题数据集中的信息。实验表明,该方法在GeneCIS上较基线显著提升,并在相关图像检索基准上进一步改进了零样本性能。事实上,尽管采用零样本评估,我们的模型在MIT-States数据集上仍超越了当前最佳监督模型。项目页面:https://sgvaze.github.io/genecis/。