The goal of spatial-temporal action detection is to determine the time and place where each person's action occurs in a video and classify the corresponding action category. Most of the existing methods adopt fully-supervised learning, which requires a large amount of training data, making it very difficult to achieve zero-shot learning. In this paper, we propose to utilize a pre-trained visual-language model to extract the representative image and text features, and model the relationship between these features through different interaction modules to obtain the interaction feature. In addition, we use this feature to prompt each label to obtain more appropriate text features. Finally, we calculate the similarity between the interaction feature and the text feature for each label to determine the action category. Our experiments on J-HMDB and UCF101-24 datasets demonstrate that the proposed interaction module and prompting make the visual-language features better aligned, thus achieving excellent accuracy for zero-shot spatio-temporal action detection. The code will be released upon acceptance.
翻译:时空动作检测的目标是确定视频中每个人物动作发生的时空位置,并分类相应的动作类别。现有方法大多采用全监督学习,需要大量训练数据,这使得实现零样本学习变得非常困难。本文提出利用预训练的视觉-语言模型提取具有代表性的图像和文本特征,并通过不同交互模块建模这些特征之间的关系,从而获得交互特征。此外,我们利用该特征对每个标签进行提示,以获得更合适的文本特征。最后,我们计算交互特征与每个标签文本特征之间的相似度,以确定动作类别。在J-HMDB和UCF101-24数据集上的实验表明,所提出的交互模块与提示方法能够使视觉-语言特征更好地对齐,从而在零样本时空动作检测中达到卓越的准确率。相关代码将在论文被接收后开源。