The recent innovations and breakthroughs in diffusion models have significantly expanded the possibilities of generating high-quality videos for the given prompts. Most existing works tackle the single-scene scenario with only one video event occurring in a single background. Extending to generate multi-scene videos nevertheless is not trivial and necessitates to nicely manage the logic in between while preserving the consistent visual appearance of key content across video scenes. In this paper, we propose a novel framework, namely VideoDrafter, for content-consistent multi-scene video generation. Technically, VideoDrafter leverages Large Language Models (LLM) to convert the input prompt into comprehensive multi-scene script that benefits from the logical knowledge learnt by LLM. The script for each scene includes a prompt describing the event, the foreground/background entities, as well as camera movement. VideoDrafter identifies the common entities throughout the script and asks LLM to detail each entity. The resultant entity description is then fed into a text-to-image model to generate a reference image for each entity. Finally, VideoDrafter outputs a multi-scene video by generating each scene video via a diffusion process that takes the reference images, the descriptive prompt of the event and camera movement into account. The diffusion model incorporates the reference images as the condition and alignment to strengthen the content consistency of multi-scene videos. Extensive experiments demonstrate that VideoDrafter outperforms the SOTA video generation models in terms of visual quality, content consistency, and user preference.
翻译:摘要:扩散模型的最新创新与突破为根据给定提示生成高质量视频显著拓展了可能性。现有工作大多针对单一场景,即仅在一个背景中发生一个视频事件。然而,扩展到生成多场景视频并非易事,需要妥善管理场景间的逻辑关系,同时保持关键内容在视频场景中的视觉外观一致性。本文提出了一种名为VideoDrafter的新框架,用于内容一致性的多场景视频生成。在技术上,VideoDrafter利用大语言模型(LLM)将输入提示转换为全面的多场景脚本,这得益于LLM所学习的逻辑知识。每个场景的脚本包括描述事件的提示、前景/背景实体以及镜头运动。VideoDrafter识别脚本中的共有实体,并要求LLM详细描述每个实体。由此产生的实体描述随后输入文本到图像模型,为每个实体生成参考图像。最后,VideoDrafter通过扩散过程生成每个场景视频,该过程综合参考图像、事件描述性提示和镜头运动,输出多场景视频。扩散模型将参考图像作为条件和对齐依据,以增强多场景视频的内容一致性。大量实验表明,VideoDrafter在视觉质量、内容一致性和用户偏好方面均优于现有的最先进视频生成模型。