In this paper, we introduce Attention Prompt Tuning (APT) - a computationally efficient variant of prompt tuning for video-based applications such as action recognition. Prompt tuning approaches involve injecting a set of learnable prompts along with data tokens during fine-tuning while keeping the backbone frozen. This approach greatly reduces the number of learnable parameters compared to full tuning. For image-based downstream tasks, normally a couple of learnable prompts achieve results close to those of full tuning. However, videos, which contain more complex spatiotemporal information, require hundreds of tunable prompts to achieve reasonably good results. This reduces the parameter efficiency observed in images and significantly increases latency and the number of floating-point operations (FLOPs) during inference. To tackle these issues, we directly inject the prompts into the keys and values of the non-local attention mechanism within the transformer block. Additionally, we introduce a novel prompt reparameterization technique to make APT more robust against hyperparameter selection. The proposed APT approach greatly reduces the number of FLOPs and latency while achieving a significant performance boost over the existing parameter-efficient tuning methods on UCF101, HMDB51, and SSv2 datasets for action recognition. The code and pre-trained models are available at https://github.com/wgcban/apt
翻译:本文提出注意力提示调优(APT)——一种用于视频任务(如动作识别)的提示调优计算高效变体。提示调优方法在微调过程中将一组可学习的提示标记与数据令牌共同注入,同时保持主干网络冻结。与全参数微调相比,该策略大幅减少了可学习参数数量。在基于图像的迁移任务中,通常仅需数个可学习提示即可达到接近全参数微调的效果。然而包含更复杂时空信息的视频则需要数百个可调提示才能获得合理结果,这削弱了其在图像任务中体现的参数效率,并显著增加推理阶段的延迟与浮点运算次数(FLOPs)。为解决这些问题,我们将提示直接注入Transformer模块中非局部注意力机制的键与值中。此外,我们提出一种新型提示重参数化技术,使APT对超参数选择更具鲁棒性。所提APT方法在UCF101、HMDB51和SSv2数据集的动作识别任务中,相较现有参数高效调优方法,不仅大幅降低FLOPs与延迟,更实现了显著性能提升。代码与预训练模型已开源至:https://github.com/wgcban/apt