VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification

Recent video multimodal large language models achieve impressive results across various benchmarks. However, current evaluations suffer from two critical limitations: (1) inflated scores can mask deficiencies in fine-grained visual understanding and reasoning, and (2) answer correctness is often measured without verifying whether models identify the precise spatio-temporal evidence supporting their predictions. To address this, we present VideoZeroBench, a hierarchical benchmark designed for challenging long-video question answering that rigorously verifies spatio-temporal evidence. It comprises 500 manually annotated questions across 13 domains, paired with temporal intervals and spatial bounding boxes as evidence. To disentangle answering generation, temporal grounding, and spatial grounding, we introduce a five-level evaluation protocol that progressively tightens evidence requirements. Experiments show that even Gemini-3-Pro correctly answers fewer than 17% of questions under the standard end-to-end QA setting (Level-3). When grounding constraints are imposed, performance drops sharply: No model exceeds 1% accuracy when both correct answering and accurate spatio-temporal localization are required (Level-5), with most failing to achieve any correct grounded predictions. These results expose a significant gap between surface-level answer correctness and genuine evidence-based reasoning, revealing that grounded video understanding remains a bottleneck for long-video QA. We further analyze performance across minimal evidence spans, atomic abilities, and inference paradigms, providing insights for future research in grounded video reasoning. The benchmark and code will be made publicly available.

翻译：现有视频多模态大语言模型在多项基准测试中取得了显著进展，但当前评估存在两大关键局限：(1) 虚高的分数可能掩盖细粒度视觉理解与推理能力的缺陷；(2) 答案正确性通常仅以结果判定，缺乏对模型是否准确识别支撑预测的精确时空证据的验证。为此，我们提出VideoZeroBench——一个面向长视频问答挑战性任务的分层基准，通过严格验证时空证据来评估模型能力。该基准涵盖13个领域的500个手工标注问题，每个问题均配有作为证据的时间区间与空间边界框。为解耦答案生成、时间定位与空间定位能力，我们设计了五级渐进式评估协议，通过逐步收紧证据要求实现能力解构。实验表明，即使最先进的Gemini-3-Pro在标准端到端问答设定（Level-3）下正确回答率也低于17%。当引入定位约束后，模型性能急剧下降：在同时要求正确答案与精确时空定位的Level-5场景中，所有模型准确率均未超过1%，多数模型甚至无法产生任何正确的定位预测。这些结果揭示了表面答案正确性与真正证据推理能力之间的巨大鸿沟，表明基于证据的视频理解仍是长视频问答的核心瓶颈。我们进一步从最小证据跨度、原子能力与推理范式三个维度展开分析，为未来基于证据的视频推理研究提供重要启示。该基准与代码将公开发布。