In this paper, we propose a novel Temporal Sequence-Aware Model (TSAM) for few-shot action recognition (FSAR), which incorporates a sequential perceiver adapter into the pre-training framework, to integrate both the spatial information and the sequential temporal dynamics into the feature embeddings. Different from the existing fine-tuning approaches that capture temporal information by exploring the relationships among all the frames, our perceiver-based adapter recurrently captures the sequential dynamics alongside the timeline, which could perceive the order change. To obtain the discriminative representations for each class, we extend a textual corpus for each class derived from the large language models (LLMs) and enrich the visual prototypes by integrating the contextual semantic information. Besides, We introduce an unbalanced optimal transport strategy for feature matching that mitigates the impact of class-unrelated features, thereby facilitating more effective decision-making. Experimental results on five FSAR datasets demonstrate that our method set a new benchmark, beating the second-best competitors with large margins.
翻译:本文提出了一种新颖的时间序列感知模型(TSAM),用于少样本动作识别(FSAR)。该模型通过在预训练框架中引入序列感知适配器,将空间信息与序列化时间动态共同整合到特征嵌入中。与现有通过探索所有帧间关系来捕捉时间信息的微调方法不同,我们基于感知器的适配器沿时间线循环捕捉序列动态,从而能够感知顺序变化。为获得每个类别的判别性表征,我们扩展了源自大语言模型(LLMs)的各类别文本语料,并通过整合上下文语义信息来丰富视觉原型。此外,我们引入了一种用于特征匹配的非平衡最优传输策略,以减轻类别无关特征的影响,从而促进更有效的决策。在五个FSAR数据集上的实验结果表明,我们的方法创造了新的性能基准,以显著优势超越了次优的竞争方法。