In this paper, we propose a new framework called YOWOv3, which is an improved version of YOWOv2, designed specifically for the task of Human Action Detection and Recognition. This framework is designed to facilitate extensive experimentation with different configurations and supports easy customization of various components within the model, reducing efforts required for understanding and modifying the code. YOWOv3 demonstrates its superior performance compared to YOWOv2 on two widely used datasets for Human Action Detection and Recognition: UCF101-24 and AVAv2.2. Specifically, the predecessor model YOWOv2 achieves an mAP of 85.2% and 20.3% on UCF101-24 and AVAv2.2, respectively, with 109.7M parameters and 53.6 GFLOPs. In contrast, our model - YOWOv3, with only 59.8M parameters and 39.8 GFLOPs, achieves an mAP of 88.33% and 20.31% on UCF101-24 and AVAv2.2, respectively. The results demonstrate that YOWOv3 significantly reduces the number of parameters and GFLOPs while still achieving comparable performance.
翻译:本文提出了一种名为YOWOv3的新框架,它是YOWOv2的改进版本,专为人体动作检测与识别任务设计。该框架旨在支持对不同配置的广泛实验,并支持轻松定制模型内的各个组件,从而减少理解和修改代码所需的工作量。在人体动作检测与识别领域两个广泛使用的数据集UCF101-24和AVAv2.2上,YOWOv3展现出优于YOWOv2的性能。具体而言,前代模型YOWOv2在UCF101-24和AVAv2.2上分别实现了85.2%和20.3%的mAP,其参数量为109.7M,计算量为53.6 GFLOPs。相比之下,我们的模型YOWOv3仅使用59.8M参数和39.8 GFLOPs,在UCF101-24和AVAv2.2上分别实现了88.33%和20.31%的mAP。结果表明,YOWOv3在显著减少参数量和计算量的同时,仍能获得可比的性能。