Recent AI-based video editing has enabled users to edit videos through simple text prompts, significantly simplifying the editing process. However, recent zero-shot video editing techniques primarily focus on global or single-object edits, which can lead to unintended changes in other parts of the video. When multiple objects require localized edits, existing methods face challenges, such as unfaithful editing, editing leakage, and lack of suitable evaluation datasets and metrics. To overcome these limitations, we propose a zero-shot $\textbf{M}$ulti-$\textbf{I}$nstance $\textbf{V}$ideo $\textbf{E}$diting framework, called MIVE. MIVE is a general-purpose mask-based framework, not dedicated to specific objects (e.g., people). MIVE introduces two key modules: (i) Disentangled Multi-instance Sampling (DMS) to prevent editing leakage and (ii) Instance-centric Probability Redistribution (IPR) to ensure precise localization and faithful editing. Additionally, we present our new MIVE Dataset featuring diverse video scenarios and introduce the Cross-Instance Accuracy (CIA) Score to evaluate editing leakage in multi-instance video editing tasks. Our extensive qualitative, quantitative, and user study evaluations demonstrate that MIVE significantly outperforms recent state-of-the-art methods in terms of editing faithfulness, accuracy, and leakage prevention, setting a new benchmark for multi-instance video editing. The project page is available at https://kaist-viclab.github.io/mive-site/
翻译:近年来,基于人工智能的视频编辑技术使得用户能够通过简单的文本提示编辑视频,极大地简化了编辑流程。然而,现有的零样本视频编辑技术主要集中于全局或单对象编辑,这可能导致视频其他部分发生非预期的改变。当需要对多个对象进行局部化编辑时,现有方法面临诸多挑战,例如编辑不忠实、编辑泄漏以及缺乏合适的评估数据集和指标。为克服这些局限,我们提出了一种零样本多实例视频编辑框架,称为MIVE。MIVE是一个通用的基于掩码的框架,并非专用于特定对象(例如人物)。MIVE引入了两个关键模块:(i) 解耦多实例采样,用于防止编辑泄漏;(ii) 实例中心概率重分布,用于确保精确定位和忠实编辑。此外,我们提出了包含多样化视频场景的新MIVE数据集,并引入了跨实例准确度评分,以评估多实例视频编辑任务中的编辑泄漏。我们广泛的定性、定量和用户研究评估表明,MIVE在编辑忠实度、准确性和泄漏预防方面显著优于当前最先进的方法,为多实例视频编辑设立了新的基准。项目页面可在 https://kaist-viclab.github.io/mive-site/ 获取。