Visual Prompt Tuning (VPT) has recently emerged as a powerful method for adapting pre-trained vision models to downstream tasks. By introducing learnable prompt tokens as task-specific instructions, VPT effectively guides pre-trained transformer models with minimal overhead. Despite its empirical success, a comprehensive theoretical understanding of VPT remains an active area of research. Building on recent insights into the connection between mixture of experts and prompt-based approaches, we identify a key limitation in VPT: the restricted functional expressiveness in prompt formulation. To address this limitation, we propose Visual Adaptive Prompt Tuning (VAPT), a new generation of prompts that redefines prompts as adaptive functions of the input. Our theoretical analysis shows that this simple yet intuitive approach achieves optimal sample efficiency. Empirical results on VTAB-1K and FGVC further demonstrate VAPT's effectiveness, with performance gains of 7.34% and 1.04% over fully fine-tuning baselines, respectively. Notably, VAPT also surpasses VPT by a substantial margin while using fewer parameters. These results highlight both the effectiveness and efficiency of our method and pave the way for future research to explore the potential of adaptive prompts. Our code is publicly available at https://github.com/Minhchuyentoancbn/VAPT
翻译:视觉提示调优(VPT)最近已成为将预训练视觉模型适配至下游任务的一种强大方法。通过引入可学习的提示标记作为任务特定指令,VPT能够以极小的开销有效引导预训练的Transformer模型。尽管其实证效果显著,但对VPT的全面理论理解仍是当前活跃的研究领域。基于最近关于专家混合与基于提示方法之间联系的见解,我们发现了VPT的一个关键局限:提示公式中受限的函数表达能力。为应对这一局限,我们提出了视觉自适应提示调优(VAPT),这是一种新一代提示方法,将提示重新定义为输入的自适应函数。我们的理论分析表明,这种简单而直观的方法实现了最优的样本效率。在VTAB-1K和FGVC上的实证结果进一步证明了VAPT的有效性,其性能分别比完全微调基线提升了7.34%和1.04%。值得注意的是,VAPT在使用更少参数的同时,其性能也大幅超越了VPT。这些结果凸显了我们方法的有效性和高效性,并为未来探索自适应提示的潜力铺平了道路。我们的代码已在https://github.com/Minhchuyentoancbn/VAPT 公开。