Geometry-Guided Camera Motion Understanding in VideoLLMs

Camera motion is a fundamental geometric signal that shapes visual perception and cinematic style, yet current video-capable vision-language models (VideoLLMs) rarely represent it explicitly and often fail on fine-grained motion primitives. We address this gap with a framework of $\textbf{benchmarking}$, $\textbf{diagnosis}$, and $\textbf{injection}$. We curate $\textbf{CameraMotionDataset}$, a large-scale synthetic dataset with explicit camera control, formulate camera motion as constraint-aware multi-label recognition, and construct a VQA benchmark--$\textbf{CameraMotionVQA}$. Across diverse off-the-shelf VideoLLMs, we observe substantial errors in recognizing camera motion primitives. Probing experiments on a Qwen2.5-VL vision encoder suggest that camera motion cues are weakly represented, especially in deeper ViT blocks, helping explain the observed failure modes. To bridge this gap without costly training or fine-tuning, we propose a lightweight, model-agnostic pipeline that extracts geometric camera cues from 3D foundation models (3DFMs), predicts constrained motion primitives with a temporal classifier, and injects them into downstream VideoLLM inference via structured prompting. Experiments demonstrate improved motion recognition and more camera-aware model responses, highlighting geometry-driven cue extraction and structured prompting as practical steps toward a camera-aware VideoLLM and VLA system. The dataset and benchmark is publicly available at https://hf.co/datasets/fengyee/camera-motion-dataset-and-benchmark.

翻译：相机运动是塑造视觉感知与电影风格的基本几何信号，然而当前具备视频处理能力的视觉语言模型（VideoLLMs）很少显式地表示它，且常在细粒度运动基元上失败。我们通过一个包含**基准测试**、**诊断**与**注入**的框架来解决这一差距。我们构建了**CameraMotionDataset**，一个具有显式相机控制的大规模合成数据集；将相机运动形式化为约束感知的多标签识别任务；并构建了一个视觉问答基准——**CameraMotionVQA**。在多种现成的VideoLLMs中，我们观察到其在识别相机运动基元方面存在显著错误。对Qwen2.5-VL视觉编码器的探测实验表明，相机运动线索的表示较弱，尤其是在更深的ViT块中，这有助于解释观察到的失败模式。为了在不进行昂贵训练或微调的情况下弥合这一差距，我们提出了一种轻量级、模型无关的流程：从3D基础模型（3DFMs）中提取几何相机线索，使用时序分类器预测受约束的运动基元，并通过结构化提示将其注入下游VideoLLM的推理过程中。实验证明了改进的运动识别能力和更具相机感知能力的模型响应，凸显了几何驱动的线索提取与结构化提示作为迈向相机感知VideoLLM和VLA系统的实用步骤。数据集与基准测试已公开于 https://hf.co/datasets/fengyee/camera-motion-dataset-and-benchmark。