Instruction-based video editing promises to democratize content creation, yet its progress is severely hampered by the scarcity of large-scale, high-quality training data. We introduce Ditto, a holistic framework designed to tackle this fundamental challenge. At its heart, Ditto features a novel data generation pipeline that fuses the creative diversity of a leading image editor with an in-context video generator, overcoming the limited scope of existing models. To make this process viable, our framework resolves the prohibitive cost-quality trade-off by employing an efficient, distilled model architecture augmented by a temporal enhancer, which simultaneously reduces computational overhead and improves temporal coherence. Finally, to achieve full scalability, this entire pipeline is driven by an intelligent agent that crafts diverse instructions and rigorously filters the output, ensuring quality control at scale. Using this framework, we invested over 12,000 GPU-days to build Ditto-1M, a new dataset of one million high-fidelity video editing examples. We trained our model, Editto, on Ditto-1M with a curriculum learning strategy. The results demonstrate superior instruction-following ability and establish a new state-of-the-art in instruction-based video editing.
翻译:基于指令的视频编辑技术有望实现内容创作的民主化,但其发展因缺乏大规模、高质量的训练数据而受到严重制约。我们提出了Ditto,一个旨在解决这一根本挑战的整体框架。Ditto的核心是一个新颖的数据生成流程,该流程融合了领先图像编辑器的创意多样性与上下文视频生成器,从而克服了现有模型适用范围有限的问题。为使该流程切实可行,我们的框架通过采用一种高效的蒸馏模型架构并结合时序增强器,解决了成本与质量之间难以调和的权衡问题,在降低计算开销的同时提升了时序连贯性。最终,为实现完全的可扩展性,整个流程由一个智能代理驱动,该代理负责生成多样化的指令并对输出进行严格筛选,从而确保大规模应用下的质量控制。利用此框架,我们投入超过12,000个GPU-天构建了Ditto-1M,这是一个包含一百万个高保真视频编辑样本的新数据集。我们采用课程学习策略,在Ditto-1M上训练了我们的模型Editto。实验结果表明,该模型具备卓越的指令跟随能力,并在基于指令的视频编辑任务上确立了新的性能标杆。