We propose a novel framework for video understanding, called Temporally Contextualized CLIP (TC-CLIP), which leverages essential temporal information through global interactions in a spatio-temporal domain within a video. To be specific, we introduce Temporal Contextualization (TC), a layer-wise temporal information infusion mechanism for videos, which 1) extracts core information from each frame, 2) connects relevant information across frames for the summarization into context tokens, and 3) leverages the context tokens for feature encoding. Furthermore, the Video-conditional Prompting (VP) module processes context tokens to generate informative prompts in the text modality. Extensive experiments in zero-shot, few-shot, base-to-novel, and fully-supervised action recognition validate the effectiveness of our model. Ablation studies for TC and VP support our design choices. Our project page with the source code is available at https://github.com/naver-ai/tc-clip
翻译:我们提出了一种新颖的视频理解框架,称为时序上下文化CLIP(TC-CLIP),该框架通过在视频的时空域中进行全局交互来利用关键时序信息。具体而言,我们提出了时序上下文化(TC)机制——一种面向视频的层级式时序信息注入方法,其包含三个核心步骤:1)从每帧图像中提取关键信息;2)跨帧关联相关信息并汇总为上下文标记;3)利用上下文标记进行特征编码。此外,视频条件提示(VP)模块通过处理上下文标记,在文本模态中生成信息丰富的提示词。我们在零样本、少样本、基类-新类泛化以及全监督动作识别任务上进行了大量实验,验证了模型的有效性。针对TC和VP模块的消融实验进一步证实了设计方案的合理性。项目主页及源代码已发布于 https://github.com/naver-ai/tc-clip。