In-context Prompt Learning for Test-time Vision Recognition with Frozen Vision-language Model

Current pre-trained vision-language models, such as CLIP, have demonstrated remarkable zero-shot generalization capabilities across various downstream tasks. However, their performance significantly degrades when test inputs exhibit different distributions. In this paper, we explore the concept of test-time prompt tuning (TTPT), which facilitates the adaptation of the CLIP model to novel downstream tasks through a one-step unsupervised optimization that involves only test samples. Inspired by in-context learning in natural language processing (NLP), we propose In-Context Prompt Learning (InCPL) for test-time visual recognition tasks, which empowers a pre-trained vision-language model with labeled examples as context information on downstream task. Specifically, InCPL associates a new test sample with very few labeled examples (sometimes just one) as context information, enabling reliable label estimation for the test sample and facilitating model adaptation. To achieve this, InCPL employs an efficient language-to-vision translator to explore the textual prior information for visual prompt learning. Further, we introduce a context-aware unsupervised loss to optimize visual prompts tailored to test samples. Finally, we design a cyclic learning strategy for visual and textual prompts to ensure mutual synergy across different modalities. This enables a pre-trained, frozen CLIP model to adapt to any task using its learned adaptive prompt. Our method demonstrates superior performance and achieves state-of-the-art results across various downstream datasets.

翻译：当前预训练的视觉语言模型（如CLIP）已在多种下游任务中展现出卓越的零样本泛化能力。然而，当测试输入呈现不同分布时，其性能会显著下降。本文探索了测试时提示调优（TTPT）的概念，该方法通过仅使用测试样本的一步无监督优化，促进CLIP模型适应新的下游任务。受自然语言处理（NLP）中上下文学习的启发，我们提出了用于测试时视觉识别任务的上下文提示学习（InCPL），该方法通过将标注示例作为下游任务的上下文信息，赋能预训练的视觉语言模型。具体而言，InCPL将新的测试样本与极少量标注示例（有时仅一个）作为上下文信息相关联，从而实现对测试样本的可靠标签估计并促进模型适应。为实现这一目标，InCPL采用高效的语言到视觉转换器，以探索用于视觉提示学习的文本先验信息。进一步，我们引入了上下文感知的无监督损失来优化针对测试样本定制的视觉提示。最后，我们设计了视觉与文本提示的循环学习策略，以确保不同模态间的相互协同。这使得预训练且冻结的CLIP模型能够利用其学习到的自适应提示适应任何任务。我们的方法在多个下游数据集上展现出优越性能，并取得了最先进的结果。