Multimodal Large Language Models (MLLMs) are advancing the ability to reason about complex sports scenarios by integrating textual and visual information. To comprehensively evaluate their capabilities, we introduce SPORTU, a benchmark designed to assess MLLMs across multi-level sports reasoning tasks. SPORTU comprises two key components: SPORTU-text, featuring 900 multiple-choice questions with human-annotated explanations for rule comprehension and strategy understanding. This component focuses on testing models' ability to reason about sports solely through question-answering (QA), without requiring visual inputs; SPORTU-video, consisting of 1,701 slow-motion video clips across 7 different sports and 12,048 QA pairs, designed to assess multi-level reasoning, from simple sports recognition to complex tasks like foul detection and rule application. We evaluate four prevalent LLMs mainly utilizing few-shot learning paradigms supplemented by chain-of-thought (CoT) prompting on the SPORTU-text part. We evaluate four LLMs using few-shot learning and chain-of-thought (CoT) prompting on SPORTU-text. GPT-4o achieves the highest accuracy of 71%, but still falls short of human-level performance, highlighting room for improvement in rule comprehension and reasoning. The evaluation for the SPORTU-video part includes 7 proprietary and 6 open-source MLLMs. Experiments show that models fall short on hard tasks that require deep reasoning and rule-based understanding. Claude-3.5-Sonnet performs the best with only 52.6% accuracy on the hard task, showing large room for improvement. We hope that SPORTU will serve as a critical step toward evaluating models' capabilities in sports understanding and reasoning.
翻译:多模态大语言模型(MLLMs)通过整合文本与视觉信息,正不断提升对复杂体育场景的推理能力。为全面评估其性能,我们提出了SPORTU基准,旨在通过多层次体育推理任务对MLLM进行系统评估。SPORTU包含两个核心组成部分:SPORTU-text包含900道附带人工标注解释的多选题,用于规则理解与策略分析。该部分专注于通过纯问答(QA)形式测试模型在无需视觉输入时的体育推理能力;SPORTU-video包含涵盖7种不同运动的1,701段慢动作视频剪辑及12,048组问答对,旨在评估从简单运动识别到犯规检测、规则应用等复杂任务的多层次推理能力。我们在SPORTU-text部分采用小样本学习范式结合思维链(CoT)提示,对四种主流大语言模型进行评估。GPT-4o以71%的准确率取得最优表现,但仍未达到人类水平,表明其在规则理解与推理方面仍有提升空间。SPORTU-video部分的评估涵盖7个专有模型与6个开源MLLM。实验表明,现有模型在需要深度推理和规则理解的困难任务上表现欠佳。Claude-3.5-Sonnet在困难任务中仅获得52.6%的准确率,显示出巨大的改进空间。我们希望SPORTU能成为评估模型体育理解与推理能力的关键基准。