With the advent of vision-language models (VLMs) that can perform in-context and prompt-based learning, how can we design prompting approaches that robustly generalize to distribution shift and can be used on novel classes outside the support set of the prompts? In this work, we first define two types of robustness to distribution shift on VLMs, namely, robustness on base classes (the classes included in the support set of prompts) and robustness on novel classes. Then, we study the robustness of existing in-context learning and prompt learning approaches, where we find that prompt learning performs robustly on test images from base classes, while it does not generalize well on images from novel classes. We propose robust prompt learning by integrating multiple-scale image features into the prompt, which improves both types of robustness. Comprehensive experiments are conducted to study the defined robustness on six benchmarks and show the effectiveness of our proposal.
翻译:随着能够进行上下文学习和基于提示学习的视觉-语言模型(VLM)的出现,我们如何设计能够稳健泛化到分布偏移、并可应用于提示支持集之外新类别的提示方法?在本文中,我们首先定义了VLM上两类对分布偏移的鲁棒性,即基类(包含在提示支持集中的类别)的鲁棒性和新类别的鲁棒性。接着,我们研究了现有上下文学习和提示学习方法的鲁棒性,发现提示学习在基类测试图像上表现稳健,但在新类别图像上泛化不佳。我们通过将多尺度图像特征整合到提示中,提出了鲁棒提示学习方法,从而提升了这两类鲁棒性。我们在六个基准上开展了全面实验,以研究所定义的鲁棒性,并证明了我们方法的有效性。