Player identification is a crucial component in vision-driven soccer analytics, enabling various downstream tasks such as player assessment, in-game analysis, and broadcast production. However, automatically detecting jersey numbers from player tracklets in videos presents challenges due to motion blur, low resolution, distortions, and occlusions. Existing methods, utilizing Spatial Transformer Networks, CNNs, and Vision Transformers, have shown success in image data but struggle with real-world video data, where jersey numbers are not visible in most of the frames. Hence, identifying frames that contain the jersey number is a key sub-problem to tackle. To address these issues, we propose a robust keyframe identification module that extracts frames containing essential high-level information about the jersey number. A spatio-temporal network is then employed to model spatial and temporal context and predict the probabilities of jersey numbers in the video. Additionally, we adopt a multi-task loss function to predict the probability distribution of each digit separately. Extensive evaluations on the SoccerNet dataset demonstrate that incorporating our proposed keyframe identification module results in a significant 37.81% and 37.70% increase in the accuracies of 2 different test sets with domain gaps. These results highlight the effectiveness and importance of our approach in tackling the challenges of automatic jersey number detection in sports videos.
翻译:球员识别是视觉驱动的足球分析中的关键组成部分,能够支持球员评估、比赛分析和转播制作等多种下游任务。然而,由于运动模糊、低分辨率、图像畸变和遮挡等问题,从视频中的球员轨迹自动检测球衣号码面临挑战。现有方法利用空间变换网络、卷积神经网络和视觉Transformer在图像数据中取得了成功,但在真实世界视频数据中表现不佳,因为大多数帧中球衣号码并不可见。因此,识别包含球衣号码的帧是一个需要解决的关键子问题。为解决这些问题,我们提出了一种鲁棒的关键帧检测模块,该模块提取包含球衣号码高层信息的帧。随后,采用时空网络对空间和时间上下文进行建模,预测视频中球衣号码的概率。此外,我们设计了多任务损失函数,分别预测每个数字的概率分布。在SoccerNet数据集上的广泛评估表明,集成我们提出的关键帧检测模块后,两个存在领域差异的测试集准确率分别显著提升了37.81%和37.70%。这些结果突显了我们的方法在解决体育视频中自动球衣号码检测挑战方面的有效性和重要性。