Recently, there has been a growing trend toward feature-based approaches for Online Action Detection (OAD). However, these approaches have limitations due to their fixed backbone design, which ignores the potential capability of a trainable backbone. In this paper, we propose the first end-to-end OAD model, termed E2E-LOAD, designed to address the major challenge of OAD, namely, long-term understanding and efficient online reasoning. Specifically, our proposed approach adopts an initial spatial model that is shared by all frames and maintains a long sequence cache for inference at a low computational cost. We also advocate an asymmetric spatial-temporal model for long-form and short-form modeling effectively. Furthermore, we propose a novel and efficient inference mechanism that accelerates heavy spatial-temporal exploration. Extensive ablation studies and experiments demonstrate the effectiveness and efficiency of our proposed method. Notably, we achieve 17.3 (+12.6) FPS for end-to-end OAD with 72.4%~(+1.2%), 90.3%~(+0.7%), and 48.1%~(+26.0%) mAP on THMOUS14, TVSeries, and HDD, respectively, which is 3x faster than previous approaches. The source code will be made publicly available.
翻译:近年来,基于特征的方法在在线动作检测(OAD)领域日益流行。然而,这些方法由于固定的骨干网络设计,忽略了可训练骨干网络的潜在能力,因此存在局限性。本文提出了首个端到端OAD模型,命名为E2E-LOAD,旨在解决OAD的主要挑战,即长期理解与高效在线推理。具体而言,我们提出的方法采用所有帧共享的初始空间模型,并维护一个长序列缓存,以较低的计算成本进行推理。同时,我们倡导使用非对称时空模型高效地处理长视频与短视频建模。此外,我们提出了一种新颖高效的推理机制,加速了重时空探索过程。大量消融研究和实验证明了所提方法的有效性和效率。值得注意的是,我们在THMOUS14、TVSeries和HDD数据集上分别实现了72.4%~(+1.2%)、90.3%~(+0.7%)和48.1%~(+26.0%)的mAP,同时端到端OAD推理速度达到17.3 (+12.6) FPS,比现有方法快3倍。源代码将公开提供。