The promising zero-shot generalization of vision-language models such as CLIP has led to their adoption using prompt learning for numerous downstream tasks. Previous works have shown test-time prompt tuning using entropy minimization to adapt text prompts for unseen domains. While effective, this overlooks the key cause for performance degradation to unseen domains -- distribution shift. In this work, we explicitly handle this problem by aligning the out-of-distribution (OOD) test sample statistics to those of the source data using prompt tuning. We use a single test sample to adapt multi-modal prompts at test time by minimizing the feature distribution shift to bridge the gap in the test domain. Evaluating against the domain generalization benchmark, our method improves zero-shot top- 1 accuracy beyond existing prompt-learning techniques, with a 3.08% improvement over the baseline MaPLe. In cross-dataset generalization with unseen categories across 10 datasets, our method improves consistently across all datasets compared to the existing state-of-the-art. Our source code and models are available at https://jameelhassan.github.io/promptalign.
翻译:视觉-语言模型(如CLIP)在零样本泛化方面的出色表现,使其通过提示学习被广泛应用于众多下游任务。先前的研究已表明,采用熵最小化的测试时提示调优可调整文本提示以适应未见领域。尽管有效,但该方法忽略了性能在未见领域下降的关键原因——分布偏移。在本工作中,我们通过使用提示调优对齐分布外(OOD)测试样本统计量与源数据统计量,显式处理这一问题。我们利用单个测试样本在测试时调整多模态提示,通过最小化特征分布偏移来弥合测试领域的差异。在领域泛化基准上的评估显示,我们的方法将零样本前1准确率提升至超越现有提示学习技术,较基线方法MaPLe提升3.08%。在跨数据集泛化任务中(涵盖10个数据集的未见类别),我们的方法在所有数据集上均持续优于现有最优方法。源代码与模型已开源至 https://jameelhassan.github.io/promptalign。