Camera pose matters. The position and orientation of each viewpoint define a shared spatial coordinate frame that relates observations across video frames. Yet this signal is largely absent from multimodal LLMs (MLLMs) for video understanding, which process frames as isolated 2D snapshots, instead of the persistent scene humans perceive. We revisit pose as a lightweight supervisory signal and introduce Cambrian-P, a video MLLM augmented with per-frame learnable camera tokens and a pose regression head. With a carefully designed sampling scheme, the model achieves substantial gains of 4.5-6.5% on spatial reasoning benchmarks such as VSI-Bench, generalizes across eight additional spatial and general video QA benchmarks, and, as a byproduct, achieves state of the art streaming pose estimation on ScanNet. Surprisingly, training on pseudo-annotated poses from in-the-wild video further improves general video QA benchmarks, showing pose helps beyond spatial reasoning. Together, these results position camera pose as a fundamental signal for video models that reason about the physical world.
翻译:相机姿态至关重要。每个视角的位置与朝向定义了共享的空间坐标系,用于关联视频帧间的观测信息。然而,这一信号在多数用于视频理解的多模态大语言模型(MLLMs)中基本缺失——这些模型将帧视为孤立的二维快照处理,而非人类感知中持续存在的场景。我们重新审视姿态作为轻量级监督信号的有效性,提出Cambrian-P——一种通过添加逐帧可学习相机令牌与姿态回归头增强的视频MLLM。采用精心设计的采样方案后,该模型在VSI-Bench等空间推理基准上实现4.5-6.5%的显著提升,在八个额外的空间与通用视频问答基准上展现出泛化能力,并作为附带成果,在ScanNet上实现了流式姿态估计的顶尖性能。令人惊讶的是,基于野外视频伪标注姿态的训练进一步改善了通用视频问答基准,表明姿态在空间推理之外亦能提供助益。这些结果共同确立了相机姿态作为物理世界推理视频模型核心信号的地位。