We introduce a new task -- language-driven video inpainting, which uses natural language instructions to guide the inpainting process. This approach overcomes the limitations of traditional video inpainting methods that depend on manually labeled binary masks, a process often tedious and labor-intensive. We present the Remove Objects from Videos by Instructions (ROVI) dataset, containing 5,650 videos and 9,091 inpainting results, to support training and evaluation for this task. We also propose a novel diffusion-based language-driven video inpainting framework, the first end-to-end baseline for this task, integrating Multimodal Large Language Models to understand and execute complex language-based inpainting requests effectively. Our comprehensive results showcase the dataset's versatility and the model's effectiveness in various language-instructed inpainting scenarios. We will make datasets, code, and models publicly available.
翻译:本文提出了一项新任务——语音驱动视频修复,该任务利用自然语言指令指导修复过程。该方法克服了传统视频修复方法依赖人工标注二值掩码的局限性,这一过程通常繁琐且劳动密集。我们构建了ROVI(通过指令从视频中移除物体)数据集,包含5,650个视频和9,091个修复结果,以支持该任务的训练与评估。同时,我们提出了一种基于扩散模型的语音驱动视频修复框架,这是该任务首个端到端基线模型,通过集成多模态大语言模型来有效理解并执行基于复杂语言的修复请求。综合实验结果表明,该数据集具有广泛适用性,且模型在多种语言指令修复场景中均表现出色。我们将公开数据集、代码及模型。