With the growing interest in pretrained vision-language models like CLIP, recent research has focused on adapting these models to downstream tasks. Despite achieving promising results, most existing methods require labeled data for all classes, which may not hold in real-world applications due to the long tail and Zipf's law. For example, some classes may lack labeled data entirely, such as emerging concepts. To address this problem, we propose a plug-and-play generative approach called \textbf{S}ynt\textbf{H}es\textbf{I}zed \textbf{P}rompts~(\textbf{SHIP}) to improve existing fine-tuning methods. Specifically, we follow variational autoencoders to introduce a generator that reconstructs the visual features by inputting the synthesized prompts and the corresponding class names to the textual encoder of CLIP. In this manner, we easily obtain the synthesized features for the remaining label-only classes. Thereafter, we fine-tune CLIP with off-the-shelf methods by combining labeled and synthesized features. Extensive experiments on base-to-new generalization, cross-dataset transfer learning, and generalized zero-shot learning demonstrate the superiority of our approach. The code is available at \url{https://github.com/mrflogs/SHIP}.
翻译:随着预训练视觉语言模型(如CLIP)受到日益广泛的关注,近期研究重点聚焦于如何将这些模型适配至下游任务。尽管现有方法取得了显著成效,但大多数方法需依赖所有类别的标注数据,而在现实应用中,受长尾分布和齐普夫定律制约,这一条件往往难以满足。例如,某些类别(如新兴概念)可能完全缺乏标注数据。针对该问题,我们提出一种即插即用的生成式方法——**合**成**提**示(**SHIP**),以改进现有微调方法。具体而言,我们沿袭变分自编码器的设计思路,引入生成器:通过将合成提示与对应类别名称输入CLIP文本编码器,重构视觉特征。由此,我们可轻松获取剩余仅有标签类别的合成特征。随后,结合标注特征与合成特征,采用现有方法对CLIP进行微调。在基类-新类泛化、跨数据集迁移学习及广义零样本学习任务上的大量实验表明,本方法具备显著优势。代码已开源至 \url{https://github.com/mrflogs/SHIP}。