Video generation primarily aims to model authentic and customized motion across frames, making understanding and controlling the motion a crucial topic. Most diffusion-based studies on video motion focus on motion customization with training-based paradigms, which, however, demands substantial training resources and necessitates retraining for diverse models. Crucially, these approaches do not explore how video diffusion models encode cross-frame motion information in their features, lacking interpretability and transparency in their effectiveness. To answer this question, this paper introduces a novel perspective to understand, localize, and manipulate motion-aware features in video diffusion models. Through analysis using Principal Component Analysis (PCA), our work discloses that robust motion-aware feature already exists in video diffusion models. We present a new MOtion FeaTure (MOFT) by eliminating content correlation information and filtering motion channels. MOFT provides a distinct set of benefits, including the ability to encode comprehensive motion information with clear interpretability, extraction without the need for training, and generalizability across diverse architectures. Leveraging MOFT, we propose a novel training-free video motion control framework. Our method demonstrates competitive performance in generating natural and faithful motion, providing architecture-agnostic insights and applicability in a variety of downstream tasks.
翻译:视频生成的核心目标在于建模跨帧的真实且可定制化运动,这使得理解与控制运动成为关键课题。大多数基于扩散模型的视频运动研究聚焦于采用基于训练范式的运动定制,然而,这种方法需要大量训练资源,且针对不同模型需重新训练。至关重要的是,这些方法并未探究视频扩散模型如何在其特征中编码跨帧运动信息,其有效性缺乏可解释性与透明度。为回答这一问题,本文引入了一种新颖的视角,以理解、定位并操控视频扩散模型中的运动感知特征。通过主成分分析(PCA)的分析,我们的工作揭示了视频扩散模型中已存在鲁棒的运动感知特征。我们通过消除内容关联信息并过滤运动通道,提出了一种新的运动特征(MOFT)。MOFT具有一系列显著优势,包括能够编码全面的运动信息并具备清晰的可解释性、无需训练即可提取,以及在不同架构间的普适性。利用MOFT,我们提出了一种新颖的无需训练的视频运动控制框架。我们的方法在生成自然且忠实运动方面展现出有竞争力的性能,提供了与架构无关的洞见,并适用于多种下游任务。