Most video reasoning models only generate textual reasoning traces without indicating when and where key evidence appears. Recent models such as OpenAI-o3 have sparked wide interest in evidence-centered reasoning for images, yet extending this ability to videos is more challenging due to the need for joint temporal tracking and spatial localization across dynamic scenes. We introduce Open-o3-Video, a non-agent framework that integrates explicit spatio-temporal evidence into video reasoning by highlighting key timestamps, objects, and bounding boxes, making the reasoning process traceable and verifiable. To enable this capability, we first construct high-quality datasets STGR that provide unified spatio-temporal supervision, which is absent in existing resources. We further adopt a cold-start reinforcement learning strategy with specially designed rewards that jointly encourage answer accuracy, temporal alignment, and spatial precision. On the V-STAR benchmark, Open-o3-Video achieves state-of-the-art performance, improving mAM by 14.4% and mLGM by 24.2% over the Qwen2.5-VL baseline, and shows consistent gains across a range of video understanding benchmarks. Beyond accuracy, the grounded reasoning traces produced by Open-o3-Video support confidence-aware test-time scaling, improving answer reliability.
翻译:多数视频推理模型仅生成文本推理痕迹,无法指示关键证据出现的时间与空间位置。近期如OpenAI-o3等模型引发了图像证据中心推理的研究热潮,但将该能力扩展至视频更具挑战性,因其需在动态场景中实现时序跟踪与空间定位的联合建模。本文提出Open-o3-Video——一种非智能体框架,通过高亮关键时间戳、目标对象及边界框,将显式时空证据融入视频推理,使推理过程可追踪可验证。为实现该能力,我们首先构建了高质量数据集STGR,提供现有资源缺失的统一时空监督信号。进一步采用冷启动强化学习策略,结合专门设计的奖励函数,同步优化答案准确性、时序对齐与空间精度。在V-STAR基准上,Open-o3-Video取得最优性能,较Qwen2.5-VL基线提升mAM指标14.4%、mLGM指标24.2%,并在多项视频理解基准中持续展现增益。除准确性外,Open-o3-Video生成的具身推理痕迹支持置信感知测试时缩放,有效提升答案可靠性。