First-Frame Propagation (FFP) offers a promising paradigm for controllable video editing, but existing methods are hampered by a reliance on cumbersome run-time guidance. We identify the root cause of this limitation as the inadequacy of current training datasets, which are often too short, low-resolution, and lack the task diversity required to teach robust temporal priors. To address this foundational data gap, we first introduce FFP-300K, a new large-scale dataset comprising 300K high-fidelity video pairs at 720p resolution and 81 frames in length, constructed via a principled two-track pipeline for diverse local and global edits. Building on this dataset, we propose a novel framework designed for true guidance-free FFP that resolves the critical tension between maintaining first-frame appearance and preserving source video motion. Architecturally, we introduce Adaptive Spatio-Temporal RoPE (AST-RoPE), which dynamically remaps positional encodings to disentangle appearance and motion references. At the objective level, we employ a self-distillation strategy where an identity propagation task acts as a powerful regularizer, ensuring long-term temporal stability and preventing semantic drift. Comprehensive experiments on the EditVerseBench benchmark demonstrate that our method significantly outperforming existing academic and commercial models by receiving about 0.2 PickScore and 0.3 VLM score improvement against these competitors.
翻译:首帧传播(FFP)为可控视频编辑提供了一种前景广阔的范式,但现有方法受限于对繁琐运行时引导的依赖。我们将此局限的根本原因归结为当前训练数据集的不足,这些数据集通常时长过短、分辨率较低,并且缺乏教授鲁棒时序先验所需的任务多样性。为弥补这一基础性数据缺口,我们首先引入了FFP-300K,这是一个新的大规模数据集,包含30万对720p分辨率、81帧长度的高保真视频对,通过一个原则性的双轨流程构建,以实现多样化的局部与全局编辑。基于此数据集,我们提出了一种专为真正无需引导的FFP设计的新框架,该框架解决了保持首帧外观与保留源视频运动之间的关键矛盾。在架构层面,我们引入了自适应时空RoPE(AST-RoPE),它能动态重映射位置编码以解耦外观与运动参考。在目标层面,我们采用了一种自蒸馏策略,其中身份传播任务作为强大的正则化器,确保了长期时序稳定性并防止语义漂移。在EditVerseBench基准测试上的综合实验表明,我们的方法显著优于现有的学术与商业模型,相较于这些竞争对手,在PickScore和VLM分数上分别获得约0.2和0.3的提升。