Understanding camera dynamics is a fundamental pillar of video spatial intelligence. However, existing multimodal models predominantly treat this task as a black-box classification, often confusing physically distinct motions by relying on superficial visual patterns rather than geometric cues. We present \textbf{CamReasoner}, a framework that reformulates camera movement understanding as a structured inference process to bridge the gap between perception and cinematic logic. Our approach centers on the Observation-Thinking-Answer (O-T-A) paradigm, which compels the model to articulate spatio-temporal observations and reason about motion patterns within an explicit reasoning block. To instill this capability, we construct a Large-scale Inference Trajectory Suite comprising 18k SFT reasoning chains and 38k RL feedback samples. To the best of our knowledge, \textbf{we are the first to employ RL for logical alignment in camera movement understanding}, ensuring motion inferences are grounded in structured visual reasoning rather than contextual guesswork. Built upon Qwen2.5-VL-7B, CamReasoner-7B improves binary classification accuracy from 73.8\% to 78.4\% and VQA accuracy from 60.9\% to 74.5\% over its backbone, consistently outperforming both proprietary and open-source baselines across multiple benchmarks.
翻译:理解摄像机动态是视频空间智能的基础支柱。然而,现有的大多模态模型主要将此任务视为黑箱分类,常依赖表面视觉模式而非几何线索,导致混淆物理上截然不同的运动。我们提出**CamReasoner**框架,将摄像机运动理解重构为结构化推理过程,以弥合感知与电影逻辑之间的差距。本方法的核心是"观察-思考-回答"(O-T-A)范式,该范式强制模型在显式推理块中阐述时空观察并推理运动模式。为培养这一能力,我们构建了大规模推理轨迹数据集,包含1.8万条SFT推理链和3.8万条RL反馈样本。据我们所知,**我们是首个将RL用于摄像机运动理解逻辑对齐的研究**,确保运动推理基于结构化视觉推理而非上下文猜测。基于Qwen2.5-VL-7B构建的CamReasoner-7B模型,在其骨干网络上将二分类准确率从73.8%提升至78.4%,VQA准确率从60.9%提升至74.5%,在多个基准测试中持续优于专有及开源基线模型。