Reasoning over sports videos for question answering is an important task with numerous applications, such as player training and information retrieval. However, this task has not been explored due to the lack of relevant datasets and the challenging nature it presents. Most datasets for video question answering (VideoQA) focus mainly on general and coarse-grained understanding of daily-life videos, which is not applicable to sports scenarios requiring professional action understanding and fine-grained motion analysis. In this paper, we introduce the first dataset, named Sports-QA, specifically designed for the sports VideoQA task. The Sports-QA dataset includes various types of questions, such as descriptions, chronologies, causalities, and counterfactual conditions, covering multiple sports. Furthermore, to address the characteristics of the sports VideoQA task, we propose a new Auto-Focus Transformer (AFT) capable of automatically focusing on particular scales of temporal information for question answering. We conduct extensive experiments on Sports-QA, including baseline studies and the evaluation of different methods. The results demonstrate that our AFT achieves state-of-the-art performance.
翻译:体育视频中的推理问答是一项重要任务,在运动员训练和信息检索等领域具有广泛应用前景。然而,由于缺乏相关数据集及任务本身的挑战性,这一领域尚未得到充分探索。现有视频问答数据集主要关注日常视频的通用粗粒度理解,难以适用于需要专业动作分析和细粒度运动分析的体育场景。本文提出了首个针对体育视频问答任务的数据集Sports-QA,该数据集包含描述、时序、因果及反事实条件等多种问题类型,覆盖多项运动项目。此外,针对体育视频问答任务的特点,我们提出了一种新型自适应聚焦Transformer(AFT),该模型能够自动聚焦特定时间尺度的时序信息进行问答。我们在Sports-QA数据集上开展了广泛的实验研究,包括基线模型分析与不同方法评估。实验结果表明,我们提出的AFT模型达到了当前最优性能。