基于证据链多模态推理的少样本时序动作定位 (Chain-of-Evidence Multimodal Reasoning for Few-shot Temporal Action Localization)

Traditional temporal action localization (TAL) methods rely on large amounts of detailed annotated data, whereas few-shot TAL reduces this dependence by using only a few training samples to identify unseen action categories. However, existing few-shot TAL methods typically focus solely on video-level information, neglecting textual information, which can provide valuable semantic support for the action localization task. To address these issues, in this work, we propose a new few-shot temporal action localization method by Chain-of-Evidence multimodal reasoning to improve localization performance. Specifically, we design a novel few-shot learning framework to capture action commonalities and variations, which includes a semantic-aware text-visual alignment module designed to align the query and support videos at different levels. Meanwhile, to better express the temporal dependencies and causal relationships between actions at the textual level, we design a Chain-of-Evidence (CoE) reasoning method that progressively guides the Vision Language Model (VLM) and Large Language Model (LLM) to generate CoE text descriptions for videos. The generated texts can capture more variance of action than visual features. We conduct extensive experiments on the publicly available ActivityNet1.3, THUMOS14 and our newly collected Human-related Anomaly Localization Dataset. The experimental results demonstrate that our proposed method significantly outperforms existing methods in single-instance and multi-instance scenarios. Our source code and data are available at https://github.com/MICLAB-BUPT/VAL-VLM.

翻译：传统的时序动作定位方法依赖于大量精细标注的数据，而少样本时序动作定位通过仅使用少量训练样本来识别未见过的动作类别，从而降低了对标注数据的依赖。然而，现有的少样本时序动作定位方法通常仅关注视频层面的信息，忽略了文本信息，而文本信息可以为动作定位任务提供有价值的语义支持。为解决这些问题，本文提出了一种基于证据链多模态推理的新型少样本时序动作定位方法，以提升定位性能。具体而言，我们设计了一个新颖的少样本学习框架来捕捉动作的共性与变化，该框架包含一个语义感知的文本-视觉对齐模块，旨在不同层次上对齐查询视频与支持视频。同时，为了更好地在文本层面表达动作之间的时序依赖与因果关系，我们设计了一种证据链推理方法，逐步引导视觉语言模型和大语言模型为视频生成证据链文本描述。生成的文本比视觉特征能捕捉到动作的更多变化。我们在公开数据集ActivityNet1.3、THUMOS14以及新收集的人类相关异常定位数据集上进行了广泛实验。实验结果表明，在单实例和多实例场景下，我们提出的方法均显著优于现有方法。源代码与数据可在https://github.com/MICLAB-BUPT/VAL-VLM获取。

相关内容

小样本学习

关注 216

小样本学习（Few-Shot Learning，以下简称 FSL ）用于解决当可用的数据量比较少时，如何提升神经网络的性能。在 FSL 中，经常用到的一类方法被称为 Meta-learning。和普通的神经网络的训练方法一样，Meta-learning 也包含训练过程和测试过程，但是它的训练过程被称作 Meta-training 和 Meta-testing。

VIP会员

文章信息

前往arXiv

下载PDF