Text-video retrieval is a critical multi-modal task to find the most relevant video for a text query. Although pretrained models like CLIP have demonstrated impressive potential in this area, the rising cost of fully finetuning these models due to increasing model size continues to pose a problem. To address this challenge, prompt tuning has emerged as an alternative. However, existing works still face two problems when adapting pretrained image-text models to downstream video-text tasks: (1) The visual encoder could only encode frame-level features and failed to extract global-level general video information. (2) Equipping the visual and text encoder with separated prompts failed to mitigate the visual-text modality gap. To this end, we propose DGL, a cross-modal Dynamic prompt tuning method with Global-Local video attention. In contrast to previous prompt tuning methods, we employ the shared latent space to generate local-level text and frame prompts that encourage inter-modal interaction. Furthermore, we propose modeling video in a global-local attention mechanism to capture global video information from the perspective of prompt tuning. Extensive experiments reveal that when only 0.67% parameters are tuned, our cross-modal prompt tuning strategy DGL outperforms or is comparable to fully finetuning methods on MSR-VTT, VATEX, LSMDC, and ActivityNet datasets. Code will be available at https://github.com/knightyxp/DGL
翻译:文本-视频检索是一项关键的多模态任务,旨在为文本查询找到最相关的视频。尽管CLIP等预训练模型在该领域展现出显著潜力,但随着模型规模增大,全参数微调这些模型的成本日益成为难题。为应对此挑战,提示调优作为一种替代方案应运而生。然而,现有方法在将预训练图像-文本模型适配至下游视频-文本任务时仍面临两个问题:(1)视觉编码器仅能编码帧级特征,无法提取全局级通用视频信息;(2)为视觉和文本编码器分别配备独立的提示,难以弥合视觉-文本模态差异。为此,我们提出DGL——一种具有全局-局部视频注意力的跨模态动态提示调优方法。与先前提示调优方法不同,我们利用共享潜在空间生成局部级文本和帧提示,以促进模态间交互。此外,我们提出在全局-局部注意力机制中对视频进行建模,从而从提示调优视角捕获全局视频信息。大量实验表明,当仅调优0.67%参数时,我们的跨模态提示调优策略DGL在MSR-VTT、VATEX、LSMDC和ActivityNet数据集上均优于或可比肩全参数微调方法。代码将开源于https://github.com/knightyxp/DGL。