EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports

While video large language models (Video-LLMs) excel in understanding slow-paced, real-world egocentric videos, their capabilities in high-velocity, information-dense virtual environments remain under-explored. Existing benchmarks focus on daily activities, yet lack a rigorous testbed for evaluating fast, rule-bound reasoning in virtual scenarios. To fill this gap, we introduce EgoEsportsQA, a pioneering video question-answering (QA) benchmark for grounding perception and reasoning in expert esports knowledge. We curate 1,745 high-quality QA pairs from professional matches across 3 first-person shooter games via a scalable six-stage pipeline. These questions are structured into a two-dimensional decoupled taxonomy: 11 sub-tasks in the cognitive capability dimension (covering perception and reasoning levels) and 6 sub-tasks in the esports knowledge dimension. Comprehensive evaluations of state-of-the-art Video-LLMs reveal that current models still fail to achieve satisfactory performance, with the best model only 71.58%. The results expose notable gaps across both axes: models exhibit stronger capabilities in basic visual perception than in deep tactical reasoning, and they grasp overall macro-progression better than fine-grained micro-operations. Extensive ablation experiments demonstrate the intrinsic weaknesses of current Video-LLM architectures. Further analysis suggests that our dataset not only reveals the connections between real-world and virtual egocentric domains, but also offers guidance for optimizing downstream esports applications, thereby fostering the future advancement of Video-LLMs in various egocentric environments.

翻译：尽管视频大语言模型（Video-LLMs）在理解慢节奏、真实世界的自我中心视频方面表现出色，但其在高速度、信息密集的虚拟环境中的能力仍待探索。现有基准侧重于日常活动，缺乏评估虚拟场景中快速、规则化推理的严格测试平台。为填补这一空白，我们提出EgoEsportsQA，这是一个开创性的视频问答（QA）基准，用于将感知与推理锚定于专家级电子竞技知识。通过可扩展的六阶段流水线，我们从3款第一人称射击游戏职业比赛中精选出1745个高质量QA对。这些问题被组织为二维解耦分类体系：认知能力维度涵盖11个子任务（覆盖感知与推理层级），电子竞技知识维度涵盖6个子任务。对当前最先进Video-LLMs的综合评估显示，现有模型仍无法达到令人满意的性能，最优模型仅为71.58%。结果暴露了两轴上的显著差距：模型在基础视觉感知方面优于深度战术推理，且更擅长把握整体宏观进程而非精细微观操作。广泛的消融实验揭示了当前Video-LLM架构的内在弱点。进一步分析表明，本数据集不仅揭示了真实世界与虚拟自我中心领域之间的关联，还为优化下游电子竞技应用提供了指导，从而推动Video-LLMs在各类自我中心环境中的未来发展。