ContextDet: Temporal Action Detection with Adaptive Context Aggregation

Temporal action detection (TAD), which locates and recognizes action segments, remains a challenging task in video understanding due to variable segment lengths and ambiguous boundaries. Existing methods treat neighboring contexts of an action segment indiscriminately, leading to imprecise boundary predictions. We introduce a single-stage ContextDet framework, which makes use of large-kernel convolutions in TAD for the first time. Our model features a pyramid adaptive context aggragation (ACA) architecture, capturing long context and improving action discriminability. Each ACA level consists of two novel modules. The context attention module (CAM) identifies salient contextual information, encourages context diversity, and preserves context integrity through a context gating block (CGB). The long context module (LCM) makes use of a mixture of large- and small-kernel convolutions to adaptively gather long-range context and fine-grained local features. Additionally, by varying the length of these large kernels across the ACA pyramid, our model provides lightweight yet effective context aggregation and action discrimination. We conducted extensive experiments and compared our model with a number of advanced TAD methods on six challenging TAD benchmarks: MultiThumos, Charades, FineAction, EPIC-Kitchens 100, Thumos14, and HACS, demonstrating superior accuracy at reduced inference speed.

翻译：时序动作检测（TAD）旨在定位并识别视频中的动作片段，由于片段长度多变且边界模糊，它仍然是视频理解领域的一项具有挑战性的任务。现有方法对动作片段的邻近上下文信息不加区分地处理，导致边界预测不精确。我们提出了一个单阶段ContextDet框架，首次在TAD中利用了大核卷积。我们的模型采用了一种金字塔式自适应上下文聚合（ACA）架构，以捕获长程上下文并提升动作判别能力。每个ACA层级包含两个新颖的模块。上下文注意力模块（CAM）通过上下文门控块（CGB）识别显著的上下文信息，鼓励上下文多样性，并保持上下文完整性。长上下文模块（LCM）利用大核卷积与小核卷积的混合，自适应地聚合长程上下文和细粒度的局部特征。此外，通过在ACA金字塔的不同层级上调整这些大核的长度，我们的模型实现了轻量级且高效的上文聚合与动作判别。我们在六个具有挑战性的TAD基准数据集（MultiThumos、Charades、FineAction、EPIC-Kitchens 100、Thumos14和HACS）上进行了大量实验，并与多种先进的TAD方法进行了比较，结果表明我们的模型在降低推理速度的同时实现了更高的准确率。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日