We introduce FaceCam, a system that generates video under customizable camera trajectories for monocular human portrait video input. Recent camera control approaches based on large video-generation models have shown promising progress but often exhibit geometric distortions and visual artifacts on portrait videos due to scale-ambiguous camera representations or 3D reconstruction errors. To overcome these limitations, we propose a face-tailored scale-aware representation for camera transformations that provides deterministic conditioning without relying on 3D priors. We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-shot stitching, to exploit stationary training cameras while generalizing to dynamic, continuous camera trajectories at inference time. Experiments on Ava-256 dataset and diverse in-the-wild videos demonstrate that FaceCam achieves superior performance in camera controllability, visual quality, identity and motion preservation.
翻译:我们提出了FaceCam系统,该系统能够为单目人体肖像视频输入生成可定制摄像机轨迹的视频。基于大型视频生成模型的近期摄像机控制方法已展现出良好进展,但由于尺度模糊的摄像机表示或三维重建误差,在肖像视频上常出现几何畸变和视觉伪影。为克服这些限制,我们提出了一种面向人脸的尺度感知摄像机变换表示方法,该表示无需依赖三维先验即可提供确定性条件化。我们在多视角演播室采集数据与真实场景单目视频上训练视频生成模型,并引入两种摄像机控制数据生成策略:合成摄像机运动与多镜头拼接,以充分利用静态训练摄像机的同时,在推理阶段泛化至动态连续的摄像机轨迹。在Ava-256数据集及多样化真实场景视频上的实验表明,FaceCam在摄像机可控性、视觉质量、身份与运动保持方面均实现了优越性能。