The text-driven image and video diffusion models have achieved unprecedented success in generating realistic and diverse content. Recently, the editing and variation of existing images and videos in diffusion-based generative models have garnered significant attention. However, previous works are limited to editing content with text or providing coarse personalization using a single visual clue, rendering them unsuitable for indescribable content that requires fine-grained and detailed control. In this regard, we propose a generic video editing framework called Make-A-Protagonist, which utilizes textual and visual clues to edit videos with the goal of empowering individuals to become the protagonists. Specifically, we leverage multiple experts to parse source video, target visual and textual clues, and propose a visual-textual-based video generation model that employs mask-guided denoising sampling to generate the desired output. Extensive results demonstrate the versatile and remarkable editing capabilities of Make-A-Protagonist.
翻译:基于文本驱动的图像与视频扩散模型在生成逼真且多样化的内容方面取得了前所未有的成功。近年来,基于扩散生成模型的现有图像与视频的编辑与变体技术引起了广泛关注。然而,以往的工作局限于通过文本编辑内容,或使用单一视觉线索提供粗略的个性化定制,导致其难以处理需要细粒度精确控制的不可描述内容。为此,我们提出一种通用的视频编辑框架——Make-A-Protagonist,该框架利用文本与视觉线索编辑视频,旨在赋予个体成为主角的能力。具体而言,我们借助多个专家系统解析源视频、目标视觉与文本线索,并提出一种基于视觉-文本的视频生成模型,该模型通过掩码引导的去噪采样生成目标输出。大量实验结果表明,Make-A-Protagonist具有强大且多样的视频编辑能力。