Few-shot action recognition (FSAR) aims to recognize novel action categories with few exemplars. Existing methods typically learn frame-level representations for each video by designing inter-frame temporal modeling strategies or inter-video interaction at the coarse video-level granularity. However, they treat each episode task in isolation and neglect fine-grained temporal relation modeling between videos, thus failing to capture shared fine-grained temporal patterns across videos and reuse temporal knowledge from historical tasks. In light of this, we propose HR2G-shot, a Hierarchical Relation-augmented Representation Generalization framework for FSAR, which unifies three types of relation modeling (inter-frame, inter-video, and inter-task) to learn task-specific temporal patterns from a holistic view. Going beyond conducting inter-frame temporal interactions, we further devise two components to respectively explore inter-video and inter-task relationships: i) Inter-video Semantic Correlation (ISC) performs cross-video frame-level interactions in a fine-grained manner, thereby capturing task-specific query features and enhancing both intra-class consistency and inter-class separability; ii) Inter-task Knowledge Transfer (IKT) retrieves and aggregates relevant temporal knowledge from the bank, which stores diverse temporal patterns from historical episode tasks. Extensive experiments on five benchmarks show that HR2G-shot outperforms current top-leading FSAR methods.
翻译:小样本动作识别(FSAR)旨在通过少量样本识别新动作类别。现有方法通常通过设计帧间时序建模策略或粗粒度视频级交互来学习每个视频的帧级表示。然而,这些方法孤立处理每个元任务,忽略了视频间细粒度时序关系建模,因而无法捕获跨视频共享的细粒度时序模式并复用历史任务的时序知识。为此,我们提出HR2G-shot——一种用于FSAR的层次化关系增强表示泛化框架,该框架统一了三种关系建模类型(帧间、视频间和任务间),从整体视角学习任务特定的时序模式。除了进行帧间时序交互,我们进一步设计了两个组件分别探索视频间与任务间关系:i)视频间语义关联(ISC)以细粒度方式执行跨视频帧级交互,从而捕获任务特定的查询特征,并增强类内一致性与类间可分离性;ii)任务间知识迁移(IKT)从存储历史元任务多样化时序模式的知识库中检索并聚合相关时序知识。在五个基准数据集上的大量实验表明,HR2G-shot优于当前领先的FSAR方法。