Reality is a dance between rigid constraints and deformable structures. For video models, that means generating motion that preserves fidelity as well as structure. Despite progress in diffusion models, producing realistic structure-preserving motion remains challenging, especially for articulated and deformable objects such as humans and animals. Scaling training data alone, so far, has failed to resolve physically implausible transitions. Existing approaches rely on conditioning with noisy motion representations, such as optical flow or skeletons extracted using an external imperfect model. To address these challenges, we introduce an algorithm to distill structure-preserving motion priors from an autoregressive video tracking model (SAM2) into a bidirectional video diffusion model (CogVideoX). With our method, we train SAM2VideoX, which contains two innovations: (1) a bidirectional feature fusion module that extracts global structure-preserving motion priors from a recurrent model like SAM2; (2) a Local Gram Flow loss that aligns how local features move together. Experiments on VBench and in human studies show that SAM2VideoX delivers consistent gains (+2.60\% on VBench, 21-22\% lower FVD, and 71.4\% human preference) over prior baselines. Specifically, on VBench, we achieve 95.51\%, surpassing REPA (92.91\%) by 2.60\%, and reduce FVD to 360.57, a 21.20\% and 22.46\% improvement over REPA- and LoRA-finetuning, respectively. The project website can be found at https://sam2videox.github.io/ .
翻译:现实世界是刚性约束与可变形结构之间的动态平衡。对于视频模型而言,这意味着需要生成既保持保真度又维持结构一致性的运动。尽管扩散模型取得了进展,但生成逼真的结构保持运动仍然具有挑战性,特别是对于人体和动物等关节化与可变形物体。迄今为止,仅通过扩大训练数据规模未能解决物理上不合理的运动过渡问题。现有方法依赖于使用噪声运动表示(如光流或通过外部不完美模型提取的骨架)进行条件化。为应对这些挑战,我们提出一种算法,从自回归视频跟踪模型(SAM2)中提炼结构保持的运动先验,并将其融入双向视频扩散模型(CogVideoX)。基于该方法,我们训练了SAM2VideoX模型,其包含两项创新:(1)双向特征融合模块,从SAM2等循环模型中提取全局结构保持的运动先验;(2)局部Gram流损失,用于对齐局部特征的协同运动模式。在VBench基准测试和人工评估实验中,SAM2VideoX相较于现有基线模型取得显著提升(VBench得分提升+2.60%,FVD降低21-22%,人工偏好率达71.4%)。具体而言,在VBench上我们获得95.51%的得分,超越REPA(92.91%)2.60%,并将FVD降至360.57,较REPA微调与LoRA微调分别提升21.20%和22.46%。项目网站详见 https://sam2videox.github.io/ 。