Few-Shot Action Recognition (FSAR) is a challenging task that requires recognizing novel action categories with a few labeled videos. Recent works typically apply semantically coarse category names as auxiliary contexts to guide the learning of discriminative visual features. However, such context provided by the action names is too limited to provide sufficient background knowledge for capturing novel spatial and temporal concepts in actions. In this paper, we propose DiST, an innovative Decomposition-incorporation framework for FSAR that makes use of decoupled Spatial and Temporal knowledge provided by large language models to learn expressive multi-granularity prototypes. In the decomposition stage, we decouple vanilla action names into diverse spatio-temporal attribute descriptions (action-related knowledge). Such commonsense knowledge complements semantic contexts from spatial and temporal perspectives. In the incorporation stage, we propose Spatial/Temporal Knowledge Compensators (SKC/TKC) to discover discriminative object-level and frame-level prototypes, respectively. In SKC, object-level prototypes adaptively aggregate important patch tokens under the guidance of spatial knowledge. Moreover, in TKC, frame-level prototypes utilize temporal attributes to assist in inter-frame temporal relation modeling. These learned prototypes thus provide transparency in capturing fine-grained spatial details and diverse temporal patterns. Experimental results show DiST achieves state-of-the-art results on five standard FSAR datasets.
翻译:小样本动作识别是一项具有挑战性的任务,它要求仅用少量标注视频识别新的动作类别。近期研究通常将语义上较为粗略的类别名称作为辅助上下文,以指导判别性视觉特征的学习。然而,动作名称所提供的此类上下文过于有限,无法为捕捉动作中新颖的空间和时间概念提供足够的背景知识。本文提出DiST,一种用于小样本动作识别的创新性分解-融合框架,该框架利用大语言模型提供的解耦空间与时间知识来学习具有表现力的多粒度原型。在分解阶段,我们将原始动作名称解耦为多样化的时空属性描述(动作相关知识)。此类常识性知识从空间和时间视角补充了语义上下文。在融合阶段,我们分别提出了空间/时间知识补偿器以发现判别性的物体级和帧级原型。在空间知识补偿器中,物体级原型在空间知识的指导下自适应地聚合重要的图像块标记。此外,在时间知识补偿器中,帧级原型利用时间属性辅助进行帧间时序关系建模。这些学习到的原型从而为捕捉细粒度空间细节和多样化时序模式提供了透明性。实验结果表明,DiST在五个标准小样本动作识别数据集上取得了最先进的结果。