Current instruction-guided video editing models struggle to simultaneously balance precise semantic modifications with faithful motion preservation. While existing approaches rely on injecting explicit external priors (e.g., VLM features or structural conditions) to mitigate these issues, this reliance severely bottlenecks model robustness and generalization. To overcome this limitation, we present SAMA (factorized Semantic Anchoring and Motion Alignment), a framework that factorizes video editing into semantic anchoring and motion modeling. First, we introduce Semantic Anchoring, which establishes a reliable visual anchor by jointly predicting semantic tokens and video latents at sparse anchor frames, enabling purely instruction-aware structural planning. Second, Motion Alignment pre-trains the same backbone on motion-centric video restoration pretext tasks (cube inpainting, speed perturbation, and tube shuffle), enabling the model to internalize temporal dynamics directly from raw videos. SAMA is optimized with a two-stage pipeline: a factorized pre-training stage that learns inherent semantic-motion representations without paired video-instruction editing data, followed by supervised fine-tuning on paired editing data. Remarkably, the factorized pre-training alone already yields strong zero-shot video editing ability, validating the proposed factorization. SAMA achieves state-of-the-art performance among open-source models and is competitive with leading commercial systems (e.g., Kling-Omni). Code, models, and datasets will be released.
翻译:当前基于指令引导的视频编辑模型难以同时兼顾精确的语义修改与忠实的运动保持。现有方法依赖注入显式外部先验(如视觉语言模型特征或结构条件)来缓解上述问题,但这种依赖严重制约了模型的鲁棒性与泛化能力。为突破这一局限,我们提出SAMA(分解式语义锚定与运动对齐)框架,将视频编辑分解为语义锚定与运动建模两个子任务。首先,我们引入语义锚定机制:通过在稀疏锚定帧上联合预测语义标记与视频潜变量,建立可靠的视觉锚点,实现纯指令驱动的结构规划。其次,运动对齐模块在运动中心的视频修复预训练任务(立方体修补、速度扰动、片段重排)中对同一骨干网络进行预训练,使模型从原始视频中直接习得时序动态。SAMA采用两阶段优化流水线:分解式预训练阶段在无配对视频-指令编辑数据条件下学习内在语义-运动表征,随后在配对编辑数据上进行监督微调。值得关注的是,仅通过分解式预训练即可获得强大的零样本视频编辑能力,验证了分解策略的有效性。SAMA在开源模型中达到最优性能,并与主流商业系统(如Kling-Omni)具有竞争力。相关代码、模型与数据集将对外开源。