Large scale vision and language models can achieve impressive zero-shot recognition performance by mapping class specific text queries to image content. Two distinct challenges that remain however, are high sensitivity to the choice of handcrafted class names that define queries, and the difficulty of adaptation to new, smaller datasets. Towards addressing these problems, we propose to leverage available data to learn, for each class, an optimal word embedding as a function of the visual content. By learning new word embeddings on an otherwise frozen model, we are able to retain zero-shot capabilities for new classes, easily adapt models to new datasets, and adjust potentially erroneous, non-descriptive or ambiguous class names. We show that our solution can easily be integrated in image classification and object detection pipelines, yields significant performance gains in multiple scenarios and provides insights into model biases and labelling errors.
翻译:大规模视觉与语言模型通过将类别特定的文本查询映射到图像内容,能实现令人印象深刻的零样本识别性能。然而,仍存在两个明显挑战:对定义查询的手工类别名称选择高度敏感,以及难以适应新的、更小的数据集。为解决这些问题,我们提出利用可用数据为每个类别学习最优的词嵌入,该嵌入作为视觉内容的函数。在冻结模型上学习新的词嵌入后,我们能够为新类别保留零样本能力,轻松将模型适应新数据集,并调整可能错误、非描述性或模糊的类别名称。我们证明,该解决方案可轻松集成到图像分类和目标检测流程中,在多种场景下带来显著的性能提升,并提供对模型偏差和标签错误的见解。