As the scale of vision models continues to grow, the emergence of Visual Prompt Tuning (VPT) as a parameter-efficient transfer learning technique has gained attention due to its superior performance compared to traditional full-finetuning. However, the conditions favoring VPT (the ``when") and the underlying rationale (the ``why") remain unclear. In this paper, we conduct a comprehensive analysis across 19 distinct datasets and tasks. To understand the ``when" aspect, we identify the scenarios where VPT proves favorable by two dimensions: task objectives and data distributions. We find that VPT is preferrable when there is 1) a substantial disparity between the original and the downstream task objectives (e.g., transitioning from classification to counting), or 2) a similarity in data distributions between the two tasks (e.g., both involve natural images). In exploring the ``why" dimension, our results indicate VPT's success cannot be attributed solely to overfitting and optimization considerations. The unique way VPT preserves original features and adds parameters appears to be a pivotal factor. Our study provides insights into VPT's mechanisms, and offers guidance for its optimal utilization.
翻译:随着视觉模型规模的持续增长,视觉提示调优(VPT)作为一种参数高效的迁移学习技术,因其相比传统完全微调的优越性能而备受关注。然而,VPT的优势条件(“何时”)及其内在原理(“为何”)仍不明确。本文对19个不同数据集和任务进行了全面分析。为理解“何时”方面,我们从任务目标和数据分布两个维度确定了VPT表现优越的场景:当1)原始任务与下游任务目标之间存在显著差异(如从分类转向计数),或2)两个任务的数据分布相似(如均涉及自然图像)时,VPT更具优势。在探究“为何”维度时,结果表明VPT的成功不能仅归因于过拟合和优化考量。VPT保留原始特征并添加参数的独特方式似乎是关键因素。本研究揭示了VPT的机制原理,并为其最优应用提供了指导。