Designing a real-time framework for the spatio-temporal action detection task is still a challenge. In this paper, we propose a novel real-time action detection framework, YOWOv2. In this new framework, YOWOv2 takes advantage of both the 3D backbone and 2D backbone for accurate action detection. A multi-level detection pipeline is designed to detect action instances of different scales. To achieve this goal, we carefully build a simple and efficient 2D backbone with a feature pyramid network to extract different levels of classification features and regression features. For the 3D backbone, we adopt the existing efficient 3D CNN to save development time. By combining 3D backbones and 2D backbones of different sizes, we design a YOWOv2 family including YOWOv2-Tiny, YOWOv2-Medium, and YOWOv2-Large. We also introduce the popular dynamic label assignment strategy and anchor-free mechanism to make the YOWOv2 consistent with the advanced model architecture design. With our improvement, YOWOv2 is significantly superior to YOWO, and can still keep real-time detection. Without any bells and whistles, YOWOv2 achieves 87.0 % frame mAP and 52.8 % video mAP with over 20 FPS on the UCF101-24. On the AVA, YOWOv2 achieves 21.7 % frame mAP with over 20 FPS. Our code is available on https://github.com/yjh0410/YOWOv2.
翻译:设计用于时空动作检测任务的实时框架仍然是一项挑战。本文提出了一种新型实时动作检测框架YOWOv2。在该新框架中,YOWOv2同时利用3D骨干网络和2D骨干网络进行精确的动作检测。我们设计了一种多层级检测流水线,以检测不同尺度的动作实例。为实现此目标,我们精心构建了一个简单高效的2D骨干网络,并配备特征金字塔网络,用于提取不同层级的分类特征和回归特征。对于3D骨干网络,我们采用现有高效的3D CNN以节省开发时间。通过组合不同大小的3D骨干网络和2D骨干网络,我们设计了包含YOWOv2-Tiny、YOWOv2-Medium和YOWOv2-Large的YOWOv2系列。我们还引入了当前流行的动态标签分配策略和无锚机制,使YOWOv2与先进的模型架构设计保持一致。通过我们的改进,YOWOv2显著优于YOWO,同时仍能保持实时检测。无需任何额外技巧,YOWOv2在UCF101-24上实现了87.0%的帧mAP和52.8%的视频mAP,帧率超过20 FPS。在AVA上,YOWOv2实现了21.7%的帧mAP,帧率超过20 FPS。我们的代码已开源在https://github.com/yjh0410/YOWOv2。