Video generation primarily aims to model authentic and customized motion across frames, making understanding and controlling the motion a crucial topic. Most diffusion-based studies on video motion focus on motion customization with training-based paradigms, which, however, demands substantial training resources and necessitates retraining for diverse models. Crucially, these approaches do not explore how video diffusion models encode cross-frame motion information in their features, lacking interpretability and transparency in their effectiveness. To answer this question, this paper introduces a novel perspective to understand, localize, and manipulate motion-aware features in video diffusion models. Through analysis using Principal Component Analysis (PCA), our work discloses that robust motion-aware feature already exists in video diffusion models. We present a new MOtion FeaTure (MOFT) by eliminating content correlation information and filtering motion channels. MOFT provides a distinct set of benefits, including the ability to encode comprehensive motion information with clear interpretability, extraction without the need for training, and generalizability across diverse architectures. Leveraging MOFT, we propose a novel training-free video motion control framework. Our method demonstrates competitive performance in generating natural and faithful motion, providing architecture-agnostic insights and applicability in a variety of downstream tasks.
翻译:视频生成的核心目标在于建模帧间真实且可定制的运动,这使得理解与控制运动成为关键课题。大多数基于扩散模型的视频运动研究聚焦于采用基于训练范式的运动定制,然而这类方法需要大量训练资源,且针对不同模型需重新训练。关键的是,这些方法并未探究视频扩散模型如何在其特征中编码跨帧运动信息,缺乏对其有效性的可解释性与透明度。为回答这一问题,本文提出一种新颖视角来理解、定位与操纵视频扩散模型中的运动感知特征。通过主成分分析(PCA)的解析,我们的工作揭示了视频扩散模型中已存在鲁棒的运动感知特征。我们通过消除内容关联信息并过滤运动通道,提出了一种新的运动特征(MOFT)。MOFT具有一系列显著优势:能够编码具有清晰可解释性的全面运动信息、无需训练即可提取,以及跨不同架构的泛化能力。基于MOFT,我们提出了一种无需训练的视频运动控制框架。该方法在生成自然且忠实运动方面展现出竞争力,为多种下游任务提供了架构无关的见解与应用可能性。