CoPL: Contextual Prompt Learning for Vision-Language Understanding

Recent advances in multimodal learning has resulted in powerful vision-language models, whose representations are generalizable across a variety of downstream tasks. Recently, their generalization ability has been further extended by incorporating trainable prompts, borrowed from the natural language processing literature. While such prompt learning techniques have shown impressive results, we identify that these prompts are trained based on global image features which limits itself in two aspects: First, by using global features, these prompts could be focusing less on the discriminative foreground image, resulting in poor generalization to various out-of-distribution test cases. Second, existing work weights all prompts equally whereas intuitively, prompts should be reweighed according to the semantics of the image. We address these as part of our proposed Contextual Prompt Learning (CoPL) framework, capable of aligning the prompts to the localized features of the image. Our key innovations over earlier works include using local image features as part of the prompt learning process, and more crucially, learning to weight these prompts based on local features that are appropriate for the task at hand. This gives us dynamic prompts that are both aligned to local image features as well as aware of local contextual relationships. Our extensive set of experiments on a variety of standard and few-shot datasets show that our method produces substantially improved performance when compared to the current state of the art methods. We also demonstrate both few-shot and out-of-distribution performance to establish the utility of learning dynamic prompts that are aligned to local image features.

翻译：多模态学习的最新进展催生了强大的视觉-语言模型，其表征能力可泛化至多种下游任务。近期，通过引入从自然语言处理领域借鉴的可训练提示，这些模型的泛化能力得到进一步拓展。尽管此类提示学习技术已展现出显著成效，但我们发现这些提示基于全局图像特征进行训练，这使其在两个方面受到局限：首先，采用全局特征可能导致提示对判别性前景图像的关注度不足，从而削弱其在各类分布外测试案例中的泛化能力；其次，现有方法对所有提示赋予相同权重，而直觉上应根据图像语义对提示进行重新加权。针对上述问题，我们提出了上下文提示学习（CoPL）框架，该框架能够将提示与图像的局部特征对齐。相较于现有工作，我们的核心创新在于：将局部图像特征融入提示学习过程，更关键的是，学习根据当前任务适配的局部特征对这些提示进行加权。由此获得的动态提示既能与局部图像特征对齐，又能感知局部上下文关系。在多种标准数据集和小样本数据集上进行的大量实验表明，与当前最先进方法相比，我们的方法在性能上取得了显著提升。同时，我们也通过小样本和分布外性能测试，验证了学习与局部图像特征对齐的动态提示的有效性。