Temporal action detection (TAD), which locates and recognizes action segments, remains a challenging task in video understanding due to variable segment lengths and ambiguous boundaries. Existing methods treat neighboring contexts of an action segment indiscriminately, leading to imprecise boundary predictions. We introduce a single-stage ContextDet framework, which makes use of large-kernel convolutions in TAD for the first time. Our model features a pyramid adaptive context aggragation (ACA) architecture, capturing long context and improving action discriminability. Each ACA level consists of two novel modules. The context attention module (CAM) identifies salient contextual information, encourages context diversity, and preserves context integrity through a context gating block (CGB). The long context module (LCM) makes use of a mixture of large- and small-kernel convolutions to adaptively gather long-range context and fine-grained local features. Additionally, by varying the length of these large kernels across the ACA pyramid, our model provides lightweight yet effective context aggregation and action discrimination. We conducted extensive experiments and compared our model with a number of advanced TAD methods on six challenging TAD benchmarks: MultiThumos, Charades, FineAction, EPIC-Kitchens 100, Thumos14, and HACS, demonstrating superior accuracy at reduced inference speed.
翻译:时序动作检测(TAD)旨在定位并识别视频中的动作片段,由于片段长度多变且边界模糊,它仍然是视频理解领域的一项具有挑战性的任务。现有方法对动作片段的邻近上下文信息不加区分地处理,导致边界预测不精确。我们提出了一个单阶段ContextDet框架,首次在TAD中利用了大核卷积。我们的模型采用了一种金字塔式自适应上下文聚合(ACA)架构,以捕获长程上下文并提升动作判别能力。每个ACA层级包含两个新颖的模块。上下文注意力模块(CAM)通过上下文门控块(CGB)识别显著的上下文信息,鼓励上下文多样性,并保持上下文完整性。长上下文模块(LCM)利用大核卷积与小核卷积的混合,自适应地聚合长程上下文和细粒度的局部特征。此外,通过在ACA金字塔的不同层级上调整这些大核的长度,我们的模型实现了轻量级且高效的上文聚合与动作判别。我们在六个具有挑战性的TAD基准数据集(MultiThumos、Charades、FineAction、EPIC-Kitchens 100、Thumos14和HACS)上进行了大量实验,并与多种先进的TAD方法进行了比较,结果表明我们的模型在降低推理速度的同时实现了更高的准确率。