Textual prompt tuning has demonstrated significant performance improvements in adapting natural language processing models to a variety of downstream tasks by treating hand-engineered prompts as trainable parameters. Inspired by the success of textual prompting, several studies have investigated the efficacy of visual prompt tuning. In this work, we present Visual Prompt Adaptation (VPA), the first framework that generalizes visual prompting with test-time adaptation. VPA introduces a small number of learnable tokens, enabling fully test-time and storage-efficient adaptation without necessitating source-domain information. We examine our VPA design under diverse adaptation settings, encompassing single-image, batched-image, and pseudo-label adaptation. We evaluate VPA on multiple tasks, including out-of-distribution (OOD) generalization, corruption robustness, and domain adaptation. Experimental results reveal that VPA effectively enhances OOD generalization by 3.3% across various models, surpassing previous test-time approaches. Furthermore, we show that VPA improves corruption robustness by 6.5% compared to strong baselines. Finally, we demonstrate that VPA also boosts domain adaptation performance by relatively 5.2%. Our VPA also exhibits marked effectiveness in improving the robustness of zero-shot recognition for vision-language models.
翻译:摘要:文本提示调优通过将人工设计的提示视为可训练参数,在使自然语言处理模型适应各类下游任务方面展现出显著的性能提升。受文本提示成功的启发,多项研究探索了视觉提示调优的效果。本文提出视觉提示自适应(VPA),这是首个将视觉提示与测试时自适应相结合的通用框架。VPA引入少量可学习标记,能够在无需源域信息的情况下实现完全测试时且存储高效的自适应。我们在多种自适应设置下检验VPA设计,包括单图像、批量图像和伪标签自适应。我们在多项任务上评估VPA,涵盖分布外(OOD)泛化、损坏鲁棒性和域自适应。实验结果表明,VPA能有效将多种模型的OOD泛化性能提升3.3%,超越先前测试时方法。此外,我们证明VPA相对于强基线将损坏鲁棒性提升6.5%。最后,我们展示VPA还能将域自适应性能相对提升5.2%。我们的VPA在提升视觉语言模型的零样本识别鲁棒性方面亦展现出显著效果。