Temporal Action Detection(TAD) is a crucial but challenging task in video understanding.It is aimed at detecting both the type and start-end frame for each action instance in a long, untrimmed video.Most current models adopt both RGB and Optical-Flow streams for the TAD task. Thus, original RGB frames must be converted manually into Optical-Flow frames with additional computation and time cost, which is an obstacle to achieve real-time processing. At present, many models adopt two-stage strategies, which would slow the inference speed down and complicatedly tuning on proposals generating.By comparison, we propose a one-stage anchor-free temporal localization method with RGB stream only, in which a novel Newtonian Mechanics-MLP architecture is established. It has comparable accuracy with all existing state-of-the-art models, while surpasses the inference speed of these methods by a large margin. The typical inference speed in this paper is astounding 4.44 video per second on THUMOS14. In applications, because there is no need to convert optical flow, the inference speed will be faster.It also proves that MLP has great potential in downstream tasks such as TAD. The source code is available at https://github.com/BonedDeng/TadML
翻译:时序动作检测(Temporal Action Detection, TAD)是视频理解中一项关键但具有挑战性的任务,旨在从长视频中检测每个动作实例的类型及起止帧。当前大多数模型采用RGB流与光流(Optical-Flow)双流架构,但原始RGB帧需手动转换为光流帧,这增加了额外计算与时间成本,成为实现实时处理的障碍。此外,现有模型多采用两阶段策略,导致推理速度缓慢且需对候选框生成进行复杂调参。相比之下,本文提出一种仅基于RGB流的单阶段无锚点时序定位方法,构建了新颖的牛顿力学-MLP架构。该方法在达到与所有现有最优模型相当精度的同时,推理速度大幅超越这些方法。在THUMOS14数据集上,典型推理速度高达每秒4.44个视频。实际应用中,由于无需转换光流,推理速度将进一步提升。此外,实验结果证明MLP在TAD等下游任务中具有巨大潜力。源代码已开源至 https://github.com/BonedDeng/TadML。