A Hybrid Attention Mechanism for Weakly-Supervised Temporal Action Localization

Weakly supervised temporal action localization is a challenging vision task due to the absence of ground-truth temporal locations of actions in the training videos. With only video-level supervision during training, most existing methods rely on a Multiple Instance Learning (MIL) framework to predict the start and end frame of each action category in a video. However, the existing MIL-based approach has a major limitation of only capturing the most discriminative frames of an action, ignoring the full extent of an activity. Moreover, these methods cannot model background activity effectively, which plays an important role in localizing foreground activities. In this paper, we present a novel framework named HAM-Net with a hybrid attention mechanism which includes temporal soft, semi-soft and hard attentions to address these issues. Our temporal soft attention module, guided by an auxiliary background class in the classification module, models the background activity by introducing an "action-ness" score for each video snippet. Moreover, our temporal semi-soft and hard attention modules, calculating two attention scores for each video snippet, help to focus on the less discriminative frames of an action to capture the full action boundary. Our proposed approach outperforms recent state-of-the-art methods by at least 2.2% mAP at IoU threshold 0.5 on the THUMOS14 dataset, and by at least 1.3% mAP at IoU threshold 0.75 on the ActivityNet1.2 dataset. Code can be found at: https://github.com/asrafulashiq/hamnet.

翻译：微弱监管的时间行动本地化是一项具有挑战性的愿景任务,因为培训视频中缺少地面真实的时间行动位置。由于培训过程中只有视频层面的监督,大多数现有方法都依赖于多实例学习框架(MIL)来预测视频中每个行动类别的开始和结束框架。然而,基于MIL的现有方法主要局限了仅仅捕捉行动最有歧视的框架,忽视了整个活动的范围。此外,这些方法无法有效地模拟背景活动,在将地面活动本地化方面起着重要作用。在本文中,我们提出了一个名为 HAM-Net 的新框架,并有一个混合关注机制,其中包括时间柔软、半软和难以关注来解决这些问题。我们的时间软关注模块,在分类模块的一个辅助背景类的指导下,通过对每个视频片断引入“行动性”评分来模拟背景活动。此外,我们的时间性半软和硬关注模块,计算每部视频精度的注意分数,可以帮助在最小行动界限上找到较不具有歧视性的行动框架 HAM-Net 。我们提议采用的最新的“0.25 ” 数据方法,在最低行动界限上以2.O- mperAP 0. 0. 格式数据格式取代了数据。

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

【AAAI2021】时间关系建模与自监督的动作分割

专知会员服务

37+阅读 · 2021年1月24日