In recent years, few-shot action recognition has attracted increasing attention. It generally adopts the paradigm of meta-learning. In this field, overcoming the overlapping distribution of classes and outliers is still a challenging problem based on limited samples. We believe the combination of Multi-modal and Multi-view can improve this issue depending on information complementarity. Therefore, we propose a method of Multi-view Distillation based on Multi-modal Fusion. Firstly, a Probability Prompt Selector for the query is constructed to generate probability prompt embedding based on the comparison score between the prompt embeddings of the support and the visual embedding of the query. Secondly, we establish a Multi-view. In each view, we fuse the prompt embedding as consistent information with visual and the global or local temporal context to overcome the overlapping distribution of classes and outliers. Thirdly, we perform the distance fusion for the Multi-view and the mutual distillation of matching ability from one to another, enabling the model to be more robust to the distribution bias. Our code is available at the URL: \url{https://github.com/cofly2014/MDMF}.
翻译:近年来,小样本动作识别受到越来越多的关注,通常采用元学习范式。在该领域中,基于有限样本克服类别分布重叠和异常值问题仍具挑战性。我们认为多模态与多视角的结合能够通过信息互补改善这一问题。因此,我们提出了一种基于多模态融合的多视角蒸馏方法。首先,构建查询样本的概率提示选择器,通过支持集提示嵌入与查询视觉嵌入的对比得分生成概率提示嵌入。其次,建立多视角结构,在每个视角中将作为一致性信息的提示嵌入与视觉信息及全局或局部时序上下文进行融合,以克服类别分布重叠和异常值问题。第三,对多视角进行距离融合,并执行视角间匹配能力的相互蒸馏,使模型对分布偏差具有更强的鲁棒性。我们的代码可通过以下链接获取:\url{https://github.com/cofly2014/MDMF}。