Recent advancements in diffusion models have significantly enhanced the quality of video generation. However, fine-grained control over camera pose remains a challenge. While U-Net-based models have shown promising results for camera control, transformer-based diffusion models (DiT)-the preferred architecture for large-scale video generation - suffer from severe degradation in camera motion accuracy. In this paper, we investigate the underlying causes of this issue and propose solutions tailored to DiT architectures. Our study reveals that camera control performance depends heavily on the choice of conditioning methods rather than camera pose representations that is commonly believed. To address the persistent motion degradation in DiT, we introduce Camera Motion Guidance (CMG), based on classifier-free guidance, which boosts camera control by over 400%. Additionally, we present a sparse camera control pipeline, significantly simplifying the process of specifying camera poses for long videos. Our method universally applies to both U-Net and DiT models, offering improved camera control for video generation tasks.
翻译:近年来,扩散模型的进展显著提升了视频生成的质量。然而,对相机姿态的细粒度控制仍然是一个挑战。尽管基于U-Net的模型在相机控制方面已展现出有希望的结果,但基于Transformer的扩散模型(DiT)——大规模视频生成的首选架构——在相机运动精度方面存在严重退化。本文研究了这一问题的根本原因,并提出了针对DiT架构的解决方案。我们的研究表明,相机控制性能在很大程度上取决于条件方法的选择,而非普遍认为的相机姿态表示。为解决DiT中持续存在的运动退化问题,我们引入了基于无分类器引导的相机运动引导(CMG),将相机控制能力提升了超过400%。此外,我们提出了一种稀疏相机控制流程,显著简化了为长视频指定相机姿态的过程。我们的方法普遍适用于U-Net和DiT模型,为视频生成任务提供了改进的相机控制能力。