Due to increasing interest in adapting models on resource-constrained edges, parameter-efficient transfer learning has been widely explored. Among various methods, Visual Prompt Tuning (VPT), prepending learnable prompts to input space, shows competitive fine-tuning performance compared to training of full network parameters. However, VPT increases the number of input tokens, resulting in additional computational overhead. In this paper, we analyze the impact of the number of prompts on fine-tuning performance and self-attention operation in a vision transformer architecture. Through theoretical and empirical analysis we show that adding more prompts does not lead to linear performance improvement. Further, we propose a Prompt Condensation (PC) technique that aims to prevent performance degradation from using a small number of prompts. We validate our methods on FGVC and VTAB-1k tasks and show that our approach reduces the number of prompts by ~70% while maintaining accuracy.
翻译:由于在资源受限的边缘设备上适配模型的需求日益增长,参数高效的迁移学习得到了广泛探索。在各种方法中,视觉提示调优(Visual Prompt Tuning, VPT)通过在输入空间中预置可学习提示,展现出与全网络参数训练相媲美的微调性能。然而,VPT会增加输入令牌数量,导致额外的计算开销。本文分析了提示数量对视觉Transformer架构中微调性能及自注意力机制的影响。通过理论和实证分析表明,增加更多提示并不会带来线性性能提升。进一步地,我们提出了一种提示压缩(Prompt Condensation, PC)技术,旨在防止使用少量提示时出现性能下降。我们在FGVC和VTAB-1k任务上验证了该方法,结果表明在保持准确率的同时,我们的方法将提示数量减少了约70%。