Contrastive Language-Image Pre-training (CLIP) has emerged as a simple yet effective way to train large-scale vision-language models. CLIP demonstrates impressive zero-shot classification and retrieval on diverse downstream tasks. However, to leverage its full potential, fine-tuning still appears to be necessary. Fine-tuning the entire CLIP model can be resource-intensive and unstable. Moreover, recent methods that aim to circumvent this need for fine-tuning still require access to images from the target distribution. In this paper, we pursue a different approach and explore the regime of training-free "name-only transfer" in which the only knowledge we possess about the downstream task comprises the names of downstream target categories. We propose a novel method, SuS-X, consisting of two key building blocks -- SuS and TIP-X, that requires neither intensive fine-tuning nor costly labelled data. SuS-X achieves state-of-the-art zero-shot classification results on 19 benchmark datasets. We further show the utility of TIP-X in the training-free few-shot setting, where we again achieve state-of-the-art results over strong training-free baselines. Code is available at https://github.com/vishaal27/SuS-X.
翻译:对比语言-图像预训练(CLIP)已成为训练大规模视觉语言模型的一种简单有效的方式。CLIP在多种下游任务中展现出令人印象深刻的零样本分类与检索能力。然而,为充分发挥其潜力,微调仍似乎必不可少。微调整个CLIP模型可能既消耗资源又不稳定。此外,近期旨在规避这一微调需求的方法仍需访问目标分布中的图像。本文另辟蹊径,探索了"仅名称迁移"这一无需训练的场景,其中我们对下游任务的唯一认知仅为目标类别的名称。我们提出了一种新颖方法SuS-X,包含两个关键构建模块——SuS与TIP-X,该方法既无需密集微调,也不依赖昂贵的有标注数据。SuS-X在19个基准数据集上取得了最优的零样本分类结果。我们还进一步展示了TIP-X在无需训练的少样本场景中的实用性,在此场景下,我们再次在强大的无训练基线方法中取得最优结果。代码已开源至https://github.com/vishaal27/SuS-X。