Recent advances in text-conditioned video diffusion have greatly improved video quality. However, these methods offer limited or sometimes no control to users on camera aspects, including dynamic camera motion, zoom, distorted lens and focus shifts. These motion and optical aspects are crucial for adding controllability and cinematic elements to generation frameworks, ultimately resulting in visual content that draws focus, enhances mood, and guides emotions according to filmmakers' controls. In this paper, we aim to close the gap between controllable video generation and camera optics. To achieve this, we propose AKiRa (Augmentation Kit on Rays), a novel augmentation framework that builds and trains a camera adapter with a complex camera model over an existing video generation backbone. It enables fine-tuned control over camera motion as well as complex optical parameters (focal length, distortion, aperture) to achieve cinematic effects such as zoom, fisheye effect, and bokeh. Extensive experiments demonstrate AKiRa's effectiveness in combining and composing camera optics while outperforming all state-of-the-art methods. This work sets a new landmark in controlled and optically enhanced video generation, paving the way for future optical video generation methods.
翻译:近期基于文本条件的视频扩散模型在视频质量方面取得了显著进展。然而,这些方法对用户提供的摄像机参数控制有限,有时甚至无法控制动态摄像机运动、变焦、镜头畸变和焦点偏移等要素。这些运动与光学特性对于增强生成框架的可控性和电影化表现至关重要,能够根据创作者的调控实现视觉内容的焦点引导、氛围强化和情绪导向。本文旨在弥合可控视频生成与摄像机光学特性之间的鸿沟。为此,我们提出AKiRa(基于光线的增强工具包)——一种创新的增强框架,通过在现有视频生成主干网络上构建并训练配备复杂摄像机模型的适配器,实现对摄像机运动及复杂光学参数(焦距、畸变、光圈)的精细化调控,从而达成变焦、鱼眼效果、背景虚化等电影化特效。大量实验证明AKiRa在组合与调控摄像机光学特性方面具有卓越效能,其表现超越所有现有先进方法。本研究为可控光学增强视频生成设立了新的里程碑,为未来光学视频生成方法的发展开辟了道路。