Visual Prompt Tuning (VPT) has recently emerged as a powerful method for adapting pre-trained vision models to downstream tasks. By introducing learnable prompt tokens as task-specific instructions, VPT effectively guides pre-trained transformer models with minimal overhead. Despite its empirical success, a comprehensive theoretical understanding of VPT remains an active area of research. Building on recent insights into the connection between mixture of experts and prompt-based approaches, we identify a key limitation in VPT: the restricted functional expressiveness in prompt formulation. To address this limitation, we propose Visual Adaptive Prompt Tuning (VAPT), a new generation of prompts that redefines prompts as adaptive functions of the input. Our theoretical analysis shows that this simple yet intuitive approach achieves optimal sample efficiency. Empirical results on VTAB-1K and FGVC further demonstrate VAPT's effectiveness, with performance gains of 7.34% and 1.04% over fully fine-tuning baselines, respectively. Notably, VAPT also surpasses VPT by a substantial margin while using fewer parameters. These results highlight both the effectiveness and efficiency of our method and pave the way for future research to explore the potential of adaptive prompts.
翻译:视觉提示调优(VPT)最近已成为将预训练视觉模型适配到下游任务的有效方法。通过引入可学习的提示标记作为任务特定指令,VPT能够以极小的开销有效引导预训练的Transformer模型。尽管取得了实证成功,对VPT的全面理论理解仍是活跃的研究领域。基于最近对专家混合与基于提示方法之间关联的见解,我们发现了VPT的一个关键局限:提示构建中受限的函数表达能力。为突破此局限,我们提出视觉自适应提示调优(VAPT),新一代提示方法将提示重新定义为输入的自适应函数。理论分析表明,这种简洁直观的方法实现了最优样本效率。在VTAB-1K和FGVC上的实证结果进一步验证了VAPT的有效性,相比完全微调基线分别获得7.34%和1.04%的性能提升。值得注意的是,VAPT在使用更少参数的同时仍大幅超越VPT。这些结果彰显了我们方法的效能与效率,为未来探索自适应提示的潜力开辟了新路径。