Sports analytics benefits from recent advances in machine learning providing a competitive advantage for teams or individuals. One important task in this context is the performance measurement of individual players to provide reports and log files for subsequent analysis. During sport events like basketball, this involves the re-identification of players during a match either from multiple camera viewpoints or from a single camera viewpoint at different times. In this work, we investigate whether it is possible to transfer the out-standing zero-shot performance of pre-trained CLIP models to the domain of player re-identification. For this purpose we reformulate the contrastive language-to-image pre-training approach from CLIP to a contrastive image-to-image training approach using the InfoNCE loss as training objective. Unlike previous work, our approach is entirely class-agnostic and benefits from large-scale pre-training. With a fine-tuned CLIP ViT-L/14 model we achieve 98.44 % mAP on the MMSports 2022 Player Re-Identification challenge. Furthermore we show that the CLIP Vision Transformers have already strong OCR capabilities to identify useful player features like shirt numbers in a zero-shot manner without any fine-tuning on the dataset. By applying the Score-CAM algorithm we visualise the most important image regions that our fine-tuned model identifies when calculating the similarity score between two images of a player.
翻译:体育分析受益于机器学习的近期进展,为团队或个人提供了竞争优势。在此背景下,一项重要任务是对个体球员进行性能度量,以生成报告和日志文件供后续分析。在篮球等体育赛事中,这涉及在比赛过程中从多视角摄像机或同一视角摄像机的不同时间点对球员进行重识别。本文探究是否能够将预训练CLIP模型卓越的零样本性能迁移至球员重识别领域。为此,我们将CLIP的对比性语言-图像预训练方法重新表述为基于InfoNCE损失作为训练目标的对比性图像-图像训练方法。与先前工作不同,我们的方法完全与类别无关,并受益于大规模预训练。通过微调CLIP ViT-L/14模型,我们在MMSports 2022球员重识别挑战赛中达到了98.44%的平均精度均值(mAP)。此外,我们证明CLIP视觉变换器已具备强大的光学字符识别(OCR)能力,能够在无需任何数据集微调的情况下以零样本方式识别有用的球员特征(如球衣号码)。通过应用Score-CAM算法,我们可视化了微调模型在计算两名球员图像相似度分数时识别出的最重要图像区域。