Spatio-temporal action detection encompasses the tasks of localizing and classifying individual actions within a video. Recent works aim to enhance this process by incorporating interaction modeling, which captures the relationship between people and their surrounding context. However, these approaches have primarily focused on fully-supervised learning, and the current limitation lies in the lack of generalization capability to recognize unseen action categories. In this paper, we aim to adapt the pretrained image-language models to detect unseen actions. To this end, we propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction. Meanwhile, our Context Prompting module will utilize contextual information to prompt labels, thereby enhancing the generation of more representative text features. Moreover, to address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism which employs pretrained visual knowledge to find each person's interest context tokens, and then these tokens will be used for prompting to generate text features tailored to each individual. To evaluate the ability to detect unseen actions, we propose a comprehensive benchmark on J-HMDB, UCF101-24, and AVA datasets. The experiments show that our method achieves superior results compared to previous approaches and can be further extended to multi-action videos, bringing it closer to real-world applications. The code and data can be found in https://webber2933.github.io/ST-CLIP-project-page.
翻译:时空动作检测包含在视频中定位和分类个体动作的任务。近期研究旨在通过引入交互建模来增强这一过程,该建模捕捉人物与其周围环境之间的关系。然而,这些方法主要集中于全监督学习,当前的局限性在于缺乏识别未见动作类别的泛化能力。在本文中,我们旨在将预训练的图文模型适配于检测未见动作。为此,我们提出一种方法,能够有效利用视觉-语言模型的丰富知识来执行人物-上下文交互。同时,我们的上下文提示模块将利用上下文信息来提示标签,从而增强生成更具代表性的文本特征。此外,为解决在同一时间戳识别多人不同动作的挑战,我们设计了兴趣令牌定位机制,该机制利用预训练的视觉知识来寻找每个人的兴趣上下文令牌,随后这些令牌将用于提示生成针对每个个体的定制化文本特征。为评估检测未见动作的能力,我们在J-HMDB、UCF101-24和AVA数据集上提出了一个综合性基准。实验表明,与先前方法相比,我们的方法取得了更优的结果,并可进一步扩展至多动作视频,使其更接近实际应用。代码和数据可在https://webber2933.github.io/ST-CLIP-project-page获取。