Few-shot Action Recognition via Intra- and Inter-Video Information Maximization

Current few-shot action recognition involves two primary sources of information for classification:(1) intra-video information, determined by frame content within a single video clip, and (2) inter-video information, measured by relationships (e.g., feature similarity) among videos. However, existing methods inadequately exploit these two information sources. In terms of intra-video information, current sampling operations for input videos may omit critical action information, reducing the utilization efficiency of video data. For the inter-video information, the action misalignment among videos makes it challenging to calculate precise relationships. Moreover, how to jointly consider both inter- and intra-video information remains under-explored for few-shot action recognition. To this end, we propose a novel framework, Video Information Maximization (VIM), for few-shot video action recognition. VIM is equipped with an adaptive spatial-temporal video sampler and a spatiotemporal action alignment model to maximize intra- and inter-video information, respectively. The video sampler adaptively selects important frames and amplifies critical spatial regions for each input video based on the task at hand. This preserves and emphasizes informative parts of video clips while eliminating interference at the data level. The alignment model performs temporal and spatial action alignment sequentially at the feature level, leading to more precise measurements of inter-video similarity. Finally, These goals are facilitated by incorporating additional loss terms based on mutual information measurement. Consequently, VIM acts to maximize the distinctiveness of video information from limited video data. Extensive experimental results on public datasets for few-shot action recognition demonstrate the effectiveness and benefits of our framework.

翻译：当前少样本动作识别涉及分类的两类主要信息源：(1) 视频内信息，由单个视频片段中的帧内容决定；(2) 视频间信息，通过视频间关系（如特征相似度）衡量。然而现有方法未能充分利用这两类信息。对于视频内信息，当前针对输入视频的采样操作可能遗漏关键动作信息，降低视频数据的利用效率。针对视频间信息，视频间的动作不对齐导致难以精确计算关系。此外，如何联合考虑视频间与视频内信息在少样本动作识别中仍研究不足。为此，我们提出了一种新型框架——视频信息最大化（VIM），用于少样本视频动作识别。VIM配备自适应时空视频采样器和时空动作对齐模型，分别最大化视频内与视频间信息。视频采样器基于当前任务自适应选择重要帧并放大关键空间区域，在数据层面保留并强调视频片段的信息部分，同时消除干扰。对齐模型在特征层面依次进行时间与空间动作对齐，从而更精确地度量视频间相似度。最后，通过引入基于互信息度量的额外损失项促进上述目标的实现。因此，VIM致力于从有限视频数据中最大化视频信息的区分度。在公开少样本动作识别数据集上的大量实验结果证明了本框架的有效性与优越性。