Temporal action localization (TAL), which involves recognizing and locating action instances, is a challenging task in video understanding. Most existing approaches directly predict action classes and regress offsets to boundaries, while overlooking the discrepant importance of each frame. In this paper, we propose an Action Sensitivity Learning framework (ASL) to tackle this task, which aims to assess the value of each frame and then leverage the generated action sensitivity to recalibrate the training procedure. We first introduce a lightweight Action Sensitivity Evaluator to learn the action sensitivity at the class level and instance level, respectively. The outputs of the two branches are combined to reweight the gradient of the two sub-tasks. Moreover, based on the action sensitivity of each frame, we design an Action Sensitive Contrastive Loss to enhance features, where the action-aware frames are sampled as positive pairs to push away the action-irrelevant frames. The extensive studies on various action localization benchmarks (i.e., MultiThumos, Charades, Ego4D-Moment Queries v1.0, Epic-Kitchens 100, Thumos14 and ActivityNet1.3) show that ASL surpasses the state-of-the-art in terms of average-mAP under multiple types of scenarios, e.g., single-labeled, densely-labeled and egocentric.
翻译:时序动作定位(TAL)涉及动作实例的识别与定位,是视频理解中的一项挑战性任务。现有方法大多直接预测动作类别并回归边界偏移,而忽略了各帧的重要性差异。本文提出动作敏感性学习框架(ASL)来解决该问题,旨在评估每帧价值,并利用生成的动作敏感性重新校准训练过程。我们首先引入轻量级动作敏感性评估器,分别学习类别级和实例级的动作敏感性,将两个分支的输出组合以重新加权两个子任务的梯度。此外,基于每帧的动作敏感性,我们设计了动作敏感对比损失函数来增强特征,其中将动作感知帧采样为正样本对,以推离与动作无关的帧。在多个动作定位基准(如MultiThumos、Charades、Ego4D-Moment Queries v1.0、Epic-Kitchens 100、Thumos14和ActivityNet1.3)上的广泛研究表明,ASL在单标签、密集标签及自我中心等多种场景下的平均mAP指标上超越了现有最优方法。