Learning Camera Movement Control from Real-World Drone Videos

This study seeks to automate camera movement control for filming existing subjects into attractive videos, contrasting with the creation of non-existent content by directly generating the pixels. We select drone videos as our test case due to their rich and challenging motion patterns, distinctive viewing angles, and precise controls. Existing AI videography methods struggle with limited appearance diversity in simulation training, high costs of recording expert operations, and difficulties in designing heuristic-based goals to cover all scenarios. To avoid these issues, we propose a scalable method that involves collecting real-world training data to improve diversity, extracting camera trajectories automatically to minimize annotation costs, and training an effective architecture that does not rely on heuristics. Specifically, we collect 99k high-quality trajectories by running 3D reconstruction on online videos, connecting camera poses from consecutive frames to formulate 3D camera paths, and using Kalman filter to identify and remove low-quality data. Moreover, we introduce DVGFormer, an auto-regressive transformer that leverages the camera path and images from all past frames to predict camera movement in the next frame. We evaluate our system across 38 synthetic natural scenes and 7 real city 3D scans. We show that our system effectively learns to perform challenging camera movements such as navigating through obstacles, maintaining low altitude to increase perceived speed, and orbiting towers and buildings, which are very useful for recording high-quality videos. Data and code are available at dvgformer.github.io.

翻译：本研究旨在实现相机运动控制的自动化，将现有拍摄对象转化为具有吸引力的视频，这与直接生成像素来创造不存在内容的方法形成对比。我们选择无人机视频作为测试案例，因其具有丰富且具有挑战性的运动模式、独特的视角以及精确的控制。现有的人工智能摄像方法面临诸多挑战：模拟训练中外观多样性有限、记录专家操作成本高昂，以及难以设计基于启发式的目标以覆盖所有场景。为避免这些问题，我们提出了一种可扩展的方法，包括收集真实世界的训练数据以提高多样性、自动提取相机轨迹以最小化标注成本，以及训练一个不依赖启发式的有效架构。具体而言，我们通过对在线视频进行三维重建，连接连续帧的相机姿态以构建三维相机路径，并使用卡尔曼滤波器识别并移除低质量数据，从而收集了99,000条高质量轨迹。此外，我们引入了DVGFormer，这是一种自回归Transformer模型，它利用相机路径及所有过去帧的图像来预测下一帧的相机运动。我们在38个合成自然场景和7个真实城市三维扫描数据上评估了我们的系统。结果表明，我们的系统能够有效学习执行具有挑战性的相机运动，例如穿越障碍物、保持低空飞行以增强感知速度，以及环绕塔楼和建筑物飞行，这些对于录制高质量视频非常有用。数据和代码可在dvgformer.github.io获取。