Recent proposed neural network-based Temporal Action Detection (TAD) models are inherently limited to extracting the discriminative representations and modeling action instances with various lengths from complex scenes by shared-weights detection heads. Inspired by the successes in dynamic neural networks, in this paper, we build a novel dynamic feature aggregation (DFA) module that can simultaneously adapt kernel weights and receptive fields at different timestamps. Based on DFA, the proposed dynamic encoder layer aggregates the temporal features within the action time ranges and guarantees the discriminability of the extracted representations. Moreover, using DFA helps to develop a Dynamic TAD head (DyHead), which adaptively aggregates the multi-scale features with adjusted parameters and learned receptive fields better to detect the action instances with diverse ranges from videos. With the proposed encoder layer and DyHead, a new dynamic TAD model, DyFADet, achieves promising performance on a series of challenging TAD benchmarks, including HACS-Segment, THUMOS14, ActivityNet-1.3, Epic-Kitchen 100, Ego4D-Moment QueriesV1.0, and FineAction. Code is released to https://github.com/yangle15/DyFADet-pytorch.
翻译:近期提出的基于神经网络的时序动作检测模型,本质上受限于使用共享权重的检测头从复杂场景中提取判别性表征并建模不同时长的动作实例。受动态神经网络成功经验的启发,本文构建了一种新颖的动态特征聚合模块,该模块能够同时适应不同时间戳的卷积核权重与感受野。基于DFA,所提出的动态编码器层聚合动作时间范围内的时序特征,并保证所提取表征的判别性。此外,利用DFA有助于开发动态TAD检测头,该检测头通过自适应调整参数与学习到的感受野来聚合多尺度特征,从而更优地从视频中检测不同时长的动作实例。结合所提出的编码器层与DyHead,新型动态TAD模型DyFADet在HACS-Segment、THUMOS14、ActivityNet-1.3、Epic-Kitchen 100、Ego4D-Moment QueriesV1.0及FineAction等一系列具有挑战性的TAD基准测试中取得了优异的性能。代码已发布于https://github.com/yangle15/DyFADet-pytorch。