Contextual Prompt Learning for Vision-Language Understanding

Recent advances in multimodal learning has resulted in powerful vision-language models, whose representations are generalizable across a variety of downstream tasks. Recently, their generalizability has been further extended by incorporating trainable prompts, borrowed from the natural language processing literature. While such prompt learning techniques have shown impressive results, we identify that these prompts are trained based on global image features which limits itself in two aspects: First, by using global features, these prompts could be focusing less on the discriminative foreground image, resulting in poor generalization to various out-of-distribution test cases. Second, existing work weights all prompts equally whereas our intuition is that these prompts are more specific to the type of the image. We address these issues with as part of our proposed Contextual Prompt Learning (CoPL) framework, capable of aligning the prompts to the localized features of the image. Our key innovations over earlier works include using local image features as part of the prompt learning process, and more crucially, learning to weight these prompts based on local features that are appropriate for the task at hand. This gives us dynamic prompts that are both aligned to local image features as well as aware of local contextual relationships. Our extensive set of experiments on a variety of standard and few-shot datasets show that our method produces substantially improved performance when compared to the current state of the art methods. We also demonstrate both few-shot and out-of-distribution performance to establish the utility of learning dynamic prompts that are aligned to local image features.

翻译：多模态学习的最新进展催生了强大的视觉-语言模型，其表征能力可泛化至多种下游任务。近年来，通过引入从自然语言处理文献中借鉴的可训练提示（prompts），这类模型的泛化能力得到进一步拓展。尽管此类提示学习技术已展现出显著效果，但我们发现这些提示是基于全局图像特征训练的，这存在两方面局限：首先，使用全局特征可能导致提示对判别性前景图像的关注不足，从而降低在不同分布外测试案例上的泛化能力；其次，现有研究将所有提示赋予相同权重，而我们直觉认为这些提示应更具图像类型特异性。为此，我们提出上下文提示学习（Contextual Prompt Learning, CoPL）框架来解决上述问题，该框架能够将提示与图像的局部特征进行对齐。相较于以往研究，我们的核心创新包括：将局部图像特征纳入提示学习过程，以及更关键地，基于任务适应的局部特征学习提示权重分配。由此产生的动态提示既能与局部图像特征对齐，又能感知局部上下文关系。在多个标准数据集和少样本数据集上的大量实验表明，我们的方法相比当前最先进方法获得了显著性能提升。同时，我们通过少样本和分布外性能测试，验证了学习与局部图像特征对齐的动态提示的有效性。