Auteur: Language-Driven Cinematographic Framing for Human-Centric Video Generation

Generative video models have achieved remarkable visual fidelity and temporal coherence, yet intentional camera control remains elusive. Existing frameworks treat camera motion as a byproduct of pixel synthesis, producing trajectories that are stochastic, spatially inconsistent, and indifferent to the human subject driving the scene. In this work, we present Auteur, a method for language-driven, human-centric camera framing in generative video. Our core insight is that professional filmmakers conceive shots not as world-space trajectories but as framings defined relative to the actor, encoding shot size, angle, and composition as functions of human pose and motion. We formalize this intuition as a human-centric camera parameterization and introduce a Domain-Specific Language (DSL) that is convertible to standard 6-DoF camera parameters. A fine-tuned multimodal large language model then acts as a virtual director, mapping natural language descriptions and coarse human motion to sparse DSL keyframes that are deterministically interpolated into continuous camera trajectories, which are then provided as input to video generators. We train and evaluate Auteur on a new dataset of 34K aligned text, human motion, and DSL-annotated camera trajectories drawn from procedural synthesis and real-world movie footage from the CondensedMovies dataset. Auteur enables cinematographic framing of human-centered scenes, a capability largely absent in prior generative models. To assess this behavior, we propose new framing-focused metrics, and our experiments show that Auteur consistently outperforms existing methods. Project page is https://cyberiada.github.io/Auteur/

翻译：生成式视频模型已实现显著的视觉保真度和时间一致性，但有意为之的摄像机控制仍难以捉摸。现有框架将摄像机运动视为像素合成的副产品，生成的轨迹具有随机性、空间不一致性，且忽略驱动场景的人类主体。在本研究中，我们提出Auteur，一种在生成式视频中实现语言驱动、以人为中心的摄像机构图方法。我们的核心洞见是，专业电影制作人并非将镜头理解为世界空间轨迹，而是定义为相对于演员的构图，将镜头尺寸、角度和构图编码为人类姿态与运动的函数。我们将这一直觉形式化为以人为中心的摄像机参数化表示，并引入一种可转换为标准6自由度摄像机参数的领域特定语言（DSL）。接着，一个经过微调的多模态大语言模型充当虚拟导演，将自然语言描述与粗略的人体运动映射为稀疏的DSL关键帧，这些关键帧通过确定性插值生成连续的摄像机轨迹，再作为输入提供给视频生成器。我们基于一个新数据集训练并评估Auteur，该数据集包含来自程序化合成和CondensedMovies数据集中真实电影片段的对齐文本、人体运动及DSL标注的摄像机轨迹（共34K条）。Auteur实现了以人为中心场景的电影化构图，这是先前生成式模型普遍缺乏的能力。为评估此行为，我们提出新的以构图为中心的指标，实验表明Auteur始终优于现有方法。项目页面位于https://cyberiada.github.io/Auteur/