Recent advances in text-conditioned video diffusion have greatly improved video quality. However, these methods offer limited or sometimes no control to users on camera aspects, including dynamic camera motion, zoom, distorted lens and focus shifts. These motion and optical aspects are crucial for adding controllability and cinematic elements to generation frameworks, ultimately resulting in visual content that draws focus, enhances mood, and guides emotions according to filmmakers' controls. In this paper, we aim to close the gap between controllable video generation and camera optics. To achieve this, we propose AKiRa (Augmentation Kit on Rays), a novel augmentation framework that builds and trains a camera adapter with a complex camera model over an existing video generation backbone. It enables fine-tuned control over camera motion as well as complex optical parameters (focal length, distortion, aperture) to achieve cinematic effects such as zoom, fisheye effect, and bokeh. Extensive experiments demonstrate AKiRa's effectiveness in combining and composing camera optics while outperforming all state-of-the-art methods. This work sets a new landmark in controlled and optically enhanced video generation, paving the way for future optical video generation methods.
翻译:近年来,基于文本条件的视频扩散模型在视频质量方面取得了显著进展。然而,这些方法在相机参数控制方面存在局限,有时甚至无法实现对动态相机运动、变焦、镜头畸变及焦点偏移等关键要素的调控。这些运动与光学特性对于增强生成框架的可控性与电影感至关重要,能够帮助创作者通过精准控制实现视觉焦点的引导、氛围的营造以及情感走向的把握。本文旨在弥合可控视频生成与相机光学特性之间的鸿沟。为此,我们提出AKiRa(Augmentation Kit on Rays)——一种基于现有视频生成主干网络构建并训练相机适配器的新型增强框架,该框架采用复杂相机模型进行建模。AKiRa支持对相机运动及复杂光学参数(焦距、畸变、光圈)进行精细化调控,从而实现变焦、鱼眼效果、背景虚化等电影化视觉效果。大量实验证明,AKiRa在相机光学特性的组合与调控方面表现卓越,其性能超越所有现有先进方法。本工作为可控光学增强视频生成领域树立了新标杆,为未来光学视频生成方法的发展开辟了道路。