Reasoning over sports videos for question answering is an important task with numerous applications, such as player training and information retrieval. However, this task has not been explored due to the lack of relevant datasets and the challenging nature it presents. Most datasets for video question answering (VideoQA) focus mainly on general and coarse-grained understanding of daily-life videos, which is not applicable to sports scenarios requiring professional action understanding and fine-grained motion analysis. In this paper, we introduce the first dataset, named Sports-QA, specifically designed for the sports VideoQA task. The Sports-QA dataset includes various types of questions, such as descriptions, chronologies, causalities, and counterfactual conditions, covering multiple sports. Furthermore, to address the characteristics of the sports VideoQA task, we propose a new Auto-Focus Transformer (AFT) capable of automatically focusing on particular scales of temporal information for question answering. We conduct extensive experiments on Sports-QA, including baseline studies and the evaluation of different methods. The results demonstrate that our AFT achieves state-of-the-art performance.
翻译:对体育视频进行推理以回答问题是一项具有众多应用的重要任务,例如运动员训练和信息检索。然而,由于缺乏相关数据集及其本身具有挑战性,这一任务尚未得到充分探索。大多数视频问答(VideoQA)数据集主要集中在日常视频的通用和粗粒度理解上,不适用于需要专业动作理解和细粒度运动分析的体育场景。在本文中,我们提出了首个专门针对体育视频问答任务的数据集,命名为Sports-QA。Sports-QA数据集包含多种类型的问题,如描述、时序、因果关系和反事实条件,覆盖多种体育项目。此外,针对体育视频问答任务的特点,我们提出了一种新的自聚焦变换器(Auto-Focus Transformer, AFT),能够自动聚焦于特定尺度的时间信息进行问答。我们在Sports-QA上进行了广泛的实验,包括基线研究和不同方法的评估。结果表明,我们的AFT达到了最先进的性能。