We introduce a new task -- language-driven video inpainting, which uses natural language instructions to guide the inpainting process. This approach overcomes the limitations of traditional video inpainting methods that depend on manually labeled binary masks, a process often tedious and labor-intensive. We present the Remove Objects from Videos by Instructions (ROVI) dataset, containing 5,650 videos and 9,091 inpainting results, to support training and evaluation for this task. We also propose a novel diffusion-based language-driven video inpainting framework, the first end-to-end baseline for this task, integrating Multimodal Large Language Models to understand and execute complex language-based inpainting requests effectively. Our comprehensive results showcase the dataset's versatility and the model's effectiveness in various language-instructed inpainting scenarios. We will make datasets, code, and models publicly available.
翻译:我们提出了一项新任务——语言驱动的视频修复,即利用自然语言指令引导修复过程。该方法克服了传统视频修复依赖人工标注二值掩码的局限性,后者通常耗时且费力。为支持该任务的训练与评估,我们提出了“基于指令的视频物体移除”(ROVI)数据集,包含5,650个视频及9,091组修复结果。同时,我们首次提出基于扩散模型的端到端语言驱动视频修复框架,该框架集成了多模态大语言模型,可有效理解并执行复杂的语言表述修复需求。综合实验结果表明,该数据集具有广泛适用性,且模型在多种语言指令修复场景中展现了有效性。我们将公开数据集、代码及模型。