This paper presents the baseline method proposed for the Sports Video task part of the MediaEval 2022 benchmark. This task proposes two subtasks: stroke classification from trimmed videos, and stroke detection from untrimmed videos. This baseline addresses both subtasks. We propose two types of 3D-CNN architectures to solve the two subtasks. Both 3D-CNNs use Spatio-temporal convolutions and attention mechanisms. The architectures and the training process are tailored to solve the addressed subtask. This baseline method is shared publicly online to help the participants in their investigation and alleviate eventually some aspects of the task such as video processing, training method, evaluation and submission routine. The baseline method reaches 86.4% of accuracy with our v2 model for the classification subtask. For the detection subtask, the baseline reaches a mAP of 0.131 and IoU of 0.515 with our v1 model.
翻译:本文提出了为MediaEval 2022基准测试中体育视频任务部分设计的基线方法。该任务包含两个子任务:从修剪视频中的击球分类,以及从原始视频中的击球检测。本基线方法同时处理这两个子任务。我们提出了两种类型的3D-CNN架构来解决这两个子任务。两种3D-CNN均采用时空卷积和注意力机制。其架构和训练过程针对所处理的子任务进行了定制。该基线方法已公开共享,旨在帮助参与者进行研究,并简化任务中的某些环节,如视频处理、训练方法、评估和提交流程。在分类子任务中,我们的v2模型达到了86.4%的准确率。在检测子任务中,我们的v1模型达到了0.131的平均精度(mAP)和0.515的交并比(IoU)。