Vision-language models have recently shown great potential on many tasks in computer vision. Meanwhile, prior work demonstrates prompt tuning designed for vision-language models could acquire superior performance on few-shot image recognition compared to linear probe, a strong baseline. In practice, many few-shot tasks are inherently correlated, particularly within specialized domains. However, such information is overlooked previously. Inspired by the fact that modeling task relationship by multi-task learning can usually boost performance, we propose a novel method SoftCPT (Soft Context Sharing for Prompt Tuning) to tune pre-trained vision-language models on multiple target few-shot tasks jointly. Specifically, we design a task-shared meta network to generate prompt context for each task using task name together with a learnable task context as input. The parameters of this meta network as well as the task context are tuned on the joint training set of all tasks. As such, the prompt context of all tasks will be shared in a soft manner. Extensive experiments across four multi-task few-shot datasets covering 44 tasks and 1593 categories demonstrate that SoftCPT significantly outperforms single-task prompt tuning methods, highlighting the effectiveness of multi-task learning for vision-language prompt tuning. Code is available at https://github.com/kding1225/softcpt.
翻译:视觉-语言模型近期在计算机视觉的诸多任务中展现出巨大潜力。同时,先前研究表明,针对视觉-语言模型设计的提示调优方法,在少样本图像识别任务中能够取得优于强基线方法线性探测的性能。在实践中,许多少样本任务本质上是相互关联的,尤其是在专业领域内。然而,此类信息此前常被忽略。受多任务学习建模任务关系通常能提升性能这一事实启发,我们提出了一种新方法SoftCPT(面向提示调优的软上下文共享),以联合调优预训练视觉-语言模型在多个目标少样本任务上的表现。具体而言,我们设计了一个任务共享的元网络,通过任务名称与可学习任务上下文作为输入,为每个任务生成提示上下文。该元网络参数及任务上下文在所有任务的联合训练集上进行调优。通过这种方式,所有任务的提示上下文将以软共享形式传递。在涵盖44项任务和1593个类别的四个多任务少样本数据集上的大量实验表明,SoftCPT显著优于单任务提示调优方法,凸显了多任务学习对视觉-语言提示调优的有效性。代码已开源:https://github.com/kding1225/softcpt。