Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance

In recent years, open-source efforts like Senorita-2M have propelled video editing toward natural language instruction. However, current publicly available datasets predominantly focus on local editing or style transfer, which largely preserve the original scene structure and are easier to scale. In contrast, Background Replacement, a task central to creative applications such as film production and advertising, requires synthesizing entirely new, temporally consistent scenes while maintaining accurate foreground-background interactions, making large-scale data generation significantly more challenging. Consequently, this complex task remains largely underexplored due to a scarcity of high-quality training data. This gap is evident in poorly performing state-of-the-art models, e.g., Kiwi-Edit, because the primary open-source dataset that contains this task, i.e., OpenVE-3M, frequently produces static, unnatural backgrounds. In this paper, we trace this quality degradation to a lack of precise background guidance during data synthesis. Accordingly, we design a scalable pipeline that generates foreground and background guidance in a decoupled manner with strict quality filtering. Building on this pipeline, we introduce Sparkle, a dataset of ~140K video pairs spanning five common background-change themes, alongside Sparkle-Bench, the largest evaluation benchmark tailored for background replacement to date. Experiments demonstrate that our dataset and the model trained on it achieve substantially better performance than all existing baselines on both OpenVE-Bench and Sparkle-Bench. Our proposed dataset, benchmark, and model are fully open-sourced at https://showlab.github.io/Sparkle/.

翻译：近年来，Senorita-2M等开源项目推动了视频编辑向自然语言指令化方向发展。然而，当前公开数据集主要聚焦于局部编辑或风格迁移，这些任务大多保留原始场景结构且易于扩展。相比之下，背景替换作为影视制作与广告等创意应用的核心任务，需合成全新且时间一致的场景，同时保持前景与背景交互的精确性，这使得大规模数据生成极具挑战性。因此，这一复杂任务因缺乏高质量训练数据而长期未被充分探索。现有最优模型（如Kiwi-Edit）的性能不足便印证了这一问题——主要包含该任务的开源数据集OpenVE-3M常生成静态且不自然的背景。本文发现，性能退化源于数据合成过程中缺乏精确的背景引导。据此，我们设计了一种可扩展的流水线，通过解耦方式生成前景与背景引导，并辅以严格的质量过滤。基于该流水线，我们提出Sparkle数据集，包含约14万个视频对，涵盖五种常见背景变换主题，同时构建了迄今最大规模的背景替换专用评估基准Sparkle-Bench。实验表明，本数据集及基于其训练的模型在OpenVE-Bench与Sparkle-Bench上均显著优于现有所有基线方法。所提数据集、基准及模型已在https://showlab.github.io/Sparkle/ 完全开源。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR2025】ShotAdapter：基于扩散模型的文本生成多镜头视频方法

专知会员服务

11+阅读 · 2025年5月16日

【CVPR2025】场景飞溅：基于视频扩散模型的单图像动势三维场景生成

专知会员服务

9+阅读 · 2025年4月4日

Sora之后，OpenAI Lilian Weng亲自撰文教你从头设计《视频生成扩散模型》

专知会员服务

22+阅读 · 2024年4月22日

Sora如何复现? 百万级真实提示库数据集，用于文本到视频扩散模型

专知会员服务

33+阅读 · 2024年3月13日