Understanding camera dynamics is a fundamental pillar of video spatial intelligence. However, existing multimodal models predominantly treat this task as a black-box classification, often confusing physically distinct motions by relying on superficial visual patterns rather than geometric cues. We present CamReasoner, a framework that reformulates camera movement understanding as a structured inference process to bridge the gap between perception and cinematic logic. Our approach centers on the Observation-Thinking-Answer (O-T-A) paradigm, which compels the model to decode spatio-temporal cues such as trajectories and view frustums within an explicit reasoning block. To instill this capability, we construct a Large-scale Inference Trajectory Suite comprising 18k SFT reasoning chains and 38k RL feedback samples. Notably, we are the first to employ RL for logical alignment in this domain, ensuring motion inferences are grounded in physical geometry rather than contextual guesswork. By applying Reinforcement Learning to the Observation-Think-Answer (O-T-A) reasoning paradigm, CamReasoner effectively suppresses hallucinations and achieves state-of-the-art performance across multiple benchmarks.
翻译:理解摄像机动态是视频空间智能的基本支柱。然而,现有的多模态模型主要将此任务视为黑盒分类,往往依赖表面视觉模式而非几何线索,从而混淆物理上不同的运动。我们提出了CamReasoner框架,该框架将摄像机运动理解重新表述为结构化推理过程,以弥合感知与电影逻辑之间的差距。我们的方法以观察-思考-回答(O-T-A)范式为核心,迫使模型在显式推理模块中解码轨迹和视锥体等时空线索。为培养这种能力,我们构建了一个大规模推理轨迹套件,包含18k条SFT推理链和38k条RL反馈样本。值得注意的是,我们首次在该领域使用强化学习进行逻辑对齐,确保运动推理基于物理几何而非上下文猜测。通过将强化学习应用于观察-思考-回答(O-T-A)推理范式,CamReasoner有效抑制了幻觉,并在多个基准测试中实现了最先进的性能。