Prompt tuning with large-scale pretrained vision-language models empowers open-vocabulary predictions trained on limited base categories, e.g., object classification and detection. In this paper, we propose compositional prompt tuning with motion cues: an extended prompt tuning paradigm for compositional predictions of video data. In particular, we present Relation Prompt (RePro) for Open-vocabulary Video Visual Relation Detection (Open-VidVRD), where conventional prompt tuning is easily biased to certain subject-object combinations and motion patterns. To this end, RePro addresses the two technical challenges of Open-VidVRD: 1) the prompt tokens should respect the two different semantic roles of subject and object, and 2) the tuning should account for the diverse spatio-temporal motion patterns of the subject-object compositions. Without bells and whistles, our RePro achieves a new state-of-the-art performance on two VidVRD benchmarks of not only the base training object and predicate categories, but also the unseen ones. Extensive ablations also demonstrate the effectiveness of the proposed compositional and multi-mode design of prompts. Code is available at https://github.com/Dawn-LX/OpenVoc-VidVRD.
翻译:利用大规模预训练视觉语言模型进行提示调优,可实现对有限基类别(如目标分类与检测)的开放词汇预测。本文提出基于运动线索的组合式提示调优:一种面向视频数据组合预测的扩展提示调优范式。具体而言,我们针对开放词汇视频视觉关系检测(Open-VidVRD)提出关系提示(RePro),其中传统提示调优容易对特定主客体组合及运动模式产生偏差。为此,RePro解决了Open-VidVRD的两个技术挑战:1)提示标记应区分主语和宾语两种不同语义角色,2)调优过程需考虑主客体组合的多样化时空运动模式。无需额外复杂设计,我们的RePro在VidVRD两个基准上不仅对基训练目标与谓词类别,甚至对未见类别均实现了新的最优性能。大量消融实验也证明了所提出的组合式多模态提示设计的有效性。代码已开源:https://github.com/Dawn-LX/OpenVoc-VidVRD。