Reasoning over sports videos for question answering is an important task with numerous applications, such as player training and information retrieval. However, this task has not been explored due to the lack of relevant datasets and the challenging nature it presents. Most datasets for video question answering (VideoQA) focus mainly on general and coarse-grained understanding of daily-life videos, which is not applicable to sports scenarios requiring professional action understanding and fine-grained motion analysis. In this paper, we introduce the first dataset, named Sports-QA, specifically designed for the sports VideoQA task. The Sports-QA dataset includes various types of questions, such as descriptions, chronologies, causalities, and counterfactual conditions, covering multiple sports. Furthermore, to address the characteristics of the sports VideoQA task, we propose a new Auto-Focus Transformer (AFT) capable of automatically focusing on particular scales of temporal information for question answering. We conduct extensive experiments on Sports-QA, including baseline studies and the evaluation of different methods. The results demonstrate that our AFT achieves state-of-the-art performance.
翻译:基于体育视频进行推理问答是一项具有重要应用价值(如运动员训练和信息检索)的任务。然而,由于缺乏相关数据集及其固有的挑战性,该任务尚未得到充分探索。现有的大多数视频问答数据集主要关注对日常生活视频的通用、粗粒度理解,难以适用于需要专业动作理解和细粒度运动分析的体育场景。本文首次提出了专门针对体育视频问答任务的数据集,命名为Sports-QA。该数据集涵盖多种体育项目,包含描述类、时序类、因果类及反事实条件类等多种问题类型。此外,为适应体育视频问答任务的特点,我们提出了一种新型自动聚焦Transformer模型,能够自动聚焦于特定尺度的时间信息以进行问答。我们在Sports-QA上进行了大量实验,包括基线研究及多种方法的评估。结果表明,所提出的AFT模型取得了最先进的性能。