Instruction-based video editing has witnessed rapid progress, yet current methods often struggle with precise visual control, as natural language is inherently limited in describing complex visual nuances. Although reference-guided editing offers a robust solution, its potential is currently bottlenecked by the scarcity of high-quality paired training data. To bridge this gap, we introduce a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets, leveraging image generative models to create synthesized reference scaffolds. Using this pipeline, we construct RefVIE, a large-scale dataset tailored for instruction-reference-following tasks, and establish RefVIE-Bench for comprehensive evaluation. Furthermore, we propose a unified editing architecture, Kiwi-Edit, that synergizes learnable queries and latent visual features for reference semantic guidance. Our model achieves significant gains in instruction following and reference fidelity via a progressive multi-stage training curriculum. Extensive experiments demonstrate that our data and architecture establish a new state-of-the-art in controllable video editing. All datasets, models, and code is released at https://github.com/showlab/Kiwi-Edit.
翻译:基于指令的视频编辑已取得快速进展,但现有方法通常难以实现精确的视觉控制,因为自然语言本质上难以描述复杂的视觉细节。尽管参考引导的编辑提供了稳健的解决方案,但其潜力目前受限于高质量配对训练数据的稀缺。为弥合这一差距,我们引入了一个可扩展的数据生成流程,该流程利用图像生成模型创建合成的参考支架,将现有的视频编辑对转换为高保真的训练四元组。利用此流程,我们构建了RefVIE——一个专为指令-参考跟随任务定制的大规模数据集,并建立了RefVIE-Bench用于全面评估。此外,我们提出了一种统一的编辑架构Kiwi-Edit,该架构协同可学习查询与潜在视觉特征,以实现参考语义引导。通过渐进式多阶段训练策略,我们的模型在指令跟随和参考保真度方面取得了显著提升。大量实验表明,我们的数据与架构在可控视频编辑领域确立了新的技术标杆。所有数据集、模型及代码均已发布于 https://github.com/showlab/Kiwi-Edit。