Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports

Yuchen Yang,Yuqing Shao,Duxiu Huang,Linfeng Dong,Yifei Liu,Suixin Tang,Xiang Zhou,Yuanyuan Gao,Wei Wang,Yue Zhou,Xue Yang,Yanfeng Wang,Xiao Sun,Zhihang Zhong

Sports have long attracted broad attention as they push the limits of human physical and cognitive capabilities. Amid growing interest in spatial intelligence for vision-language models (VLMs), sports provide a natural testbed for understanding high-intensity human motion and dynamic object interactions. To this end, we present CourtSI, the first large-scale spatial intelligence dataset tailored to sports scenarios. CourtSI contains over 1M QA pairs, organized under a holistic taxonomy that systematically covers spatial counting, distance measurement, localization, and relational reasoning, across representative net sports including badminton, tennis, and table tennis. Leveraging well-defined court geometry as metric anchors, we develop a semi-automatic data engine to reconstruct sports scenes, enabling scalable curation of CourtSI. In addition, we introduce CourtSI-Bench, a high-quality evaluation benchmark comprising 3,686 QA pairs with rigorous human verification. We evaluate 25 proprietary and open-source VLMs on CourtSI-Bench, revealing a remaining human-AI performance gap and limited generalization from existing spatial intelligence benchmarks. These findings indicate that sports scenarios expose limitations in spatial intelligence capabilities captured by existing benchmarks. Further, fine-tuning Qwen3-VL-8B on CourtSI improves accuracy on CourtSI-Bench by 23.5 percentage points. The adapted model also generalizes effectively to CourtSI-Ext, an evaluation set built on a similar but unseen sport, and demonstrates enhanced spatial-aware commentary generation. Together, these findings demonstrate that CourtSI provides a scalable pathway toward advancing spatial intelligence of VLMs in sports.

翻译：体育运动因不断挑战人类身体与认知极限而长期受到广泛关注。随着对视觉语言模型空间智能兴趣的日益增长，体育领域为理解高强度人体运动与动态物体交互提供了天然的测试平台。为此，我们提出首个面向体育场景的大规模空间智能数据集CourtSI。该数据集包含超过100万个问答对，按照系统性的分类体系组织，全面覆盖羽毛球、网球、乒乓球等代表性网类运动中的空间计数、距离测量、定位与关系推理任务。通过利用明确定义的场地几何结构作为度量基准，我们开发了半自动数据引擎以重建体育场景，实现了CourtSI的可扩展构建。此外，我们推出了经过严格人工验证的高质量评估基准CourtSI-Bench，包含3,686个问答对。我们在该基准上评估了25个专有及开源视觉语言模型，揭示了当前模型与人类表现之间仍存在差距，且现有空间智能基准的泛化能力有限。这些发现表明，体育场景暴露出现有基准测试在捕捉空间智能能力方面的局限性。进一步地，基于CourtSI对Qwen3-VL-8B进行微调后，其在CourtSI-Bench上的准确率提升了23.5个百分点。微调后的模型在基于相似但未见过运动构建的评估集CourtSI-Ext上也展现出有效的泛化能力，并表现出增强的空间感知解说生成能力。综上所述，本研究证明CourtSI为推进视觉语言模型在体育领域的空间智能发展提供了可扩展的路径。