Visual Prompt Tuning (VPT) techniques have gained prominence for their capacity to adapt pre-trained Vision Transformers (ViTs) to downstream visual tasks using specialized learnable tokens termed as prompts. Contemporary VPT methodologies, especially when employed with self-supervised vision transformers, often default to the introduction of new learnable prompts or gated prompt tokens predominantly sourced from the model's previous block. A pivotal oversight in such approaches is their failure to harness the potential of long-range previous blocks as sources of prompts within each self-supervised ViT. To bridge this crucial gap, we introduce Long-term Spatial Prompt Tuning (LSPT) - a revolutionary approach to visual representation learning. Drawing inspiration from the intricacies of the human brain, LSPT ingeniously incorporates long-term gated prompts. This feature serves as temporal coding, curbing the risk of forgetting parameters acquired from earlier blocks. Further enhancing its prowess, LSPT brings into play patch tokens, serving as spatial coding. This is strategically designed to perpetually amass class-conscious features, thereby fortifying the model's prowess in distinguishing and identifying visual categories. To validate the efficacy of our proposed method, we engaged in rigorous experimentation across 5 FGVC and 19 VTAB-1K benchmarks. Our empirical findings underscore the superiority of LSPT, showcasing its ability to set new benchmarks in visual prompt tuning performance.
翻译:视觉提示调优技术因其能够通过称为提示的可学习专用令牌,将预训练视觉Transformer适应下游视觉任务而备受关注。当前的视觉提示调优方法,尤其是与自监督视觉Transformer结合使用时,通常默认引入新的可学习提示或主要来自模型前一个模块的门控提示令牌。这些方法的关键疏漏在于未能利用自监督ViT中远距离前序模块作为提示来源的潜力。为填补这一关键空白,我们提出长期空间提示调优——一种革命性的视觉表示学习方法。受人类大脑复杂机制的启发,LSPT巧妙地融合了长期门控提示,该特性作为时间编码,可有效抑制从前序模块习得参数的遗忘风险。为进一步增强其能力,LSPT引入补丁令牌作为空间编码,旨在持续积累类别感知特征,从而强化模型区分和识别视觉类别的能力。为验证所提方法的有效性,我们在5个FGVC基准和19个VTAB-1K基准上进行了严格实验。经验性结果凸显了LSPT的优越性,展现了其在视觉提示调优性能方面树立新标杆的能力。