Advancements in vision-language models (VLMs) have propelled the field of computer vision, particularly in the zero-shot learning setting. Despite their promise, the effectiveness of these models often diminishes due to domain shifts in test environments. To address this, we introduce the Test-Time Prototype Shifting (TPS) framework, a pioneering approach designed to adapt VLMs to test datasets using unlabeled test inputs. Our method is based on the notion of modulating per-class prototypes in the shared embedding space. By pre-computing and caching prototypes generated with the pre-trained text encoder, TPS not only facilitates optimization-free prototype reuse for subsequent predictions but also enables seamless integration with current advancements in prompt engineering. At test-time, TPS dynamically learns shift vectors for each prototype based solely on the given test sample, effectively bridging the domain gap and enhancing classification accuracy. A notable aspect of our framework is its significantly reduced memory and computational demands when compared to conventional text-prompt tuning methods. Extensive evaluations across 15 datasets involving natural distribution shifts and cross-dataset generalization demonstrate TPS's superior performance, achieving state-of-the-art results while reducing resource requirements.
翻译:摘要:视觉语言模型(VLM)的进步推动了计算机视觉领域的发展,尤其在零样本学习场景中表现突出。然而尽管具有潜力,这些模型的有效性常因测试环境中的领域偏移而下降。为此,我们提出测试时原型偏移(TPS)框架,这是一种创新方法,旨在利用无标签测试输入使VLM适应测试数据集。我们的方法基于在共享嵌入空间中调节每个类别原型的概念。通过预计算并缓存预训练文本编码器生成的原型,TPS不仅可通过免除优化过程实现原型复用用于后续预测,还能与当前提示工程的最新进展无缝集成。在测试阶段,TPS仅根据给定测试样本动态学习每个原型的偏移向量,有效弥合领域差距并提升分类精度。本框架的一个显著特点在于其与传统文本提示调优方法相比,大幅降低了内存与计算需求。在涉及自然分布偏移与跨数据集泛化的15个数据集上的广泛评估表明,TPS在降低资源需求的同时实现了最优性能,达到了当前最先进水平。