VideoStudio: Generating Consistent-Content and Multi-Scene Videos

The recent innovations and breakthroughs in diffusion models have significantly expanded the possibilities of generating high-quality videos for the given prompts. Most existing works tackle the single-scene scenario with only one video event occurring in a single background. Extending to generate multi-scene videos nevertheless is not trivial and necessitates to nicely manage the logic in between while preserving the consistent visual appearance of key content across video scenes. In this paper, we propose a novel framework, namely VideoStudio, for consistent-content and multi-scene video generation. Technically, VideoStudio leverages Large Language Models (LLM) to convert the input prompt into comprehensive multi-scene script that benefits from the logical knowledge learnt by LLM. The script for each scene includes a prompt describing the event, the foreground/background entities, as well as camera movement. VideoStudio identifies the common entities throughout the script and asks LLM to detail each entity. The resultant entity description is then fed into a text-to-image model to generate a reference image for each entity. Finally, VideoStudio outputs a multi-scene video by generating each scene video via a diffusion process that takes the reference images, the descriptive prompt of the event and camera movement into account. The diffusion model incorporates the reference images as the condition and alignment to strengthen the content consistency of multi-scene videos. Extensive experiments demonstrate that VideoStudio outperforms the SOTA video generation models in terms of visual quality, content consistency, and user preference. Source code is available at \url{https://github.com/FuchenUSTC/VideoStudio}.

翻译：扩散模型的最新创新与突破显著扩展了根据给定提示生成高质量视频的可能性。现有研究大多针对单场景情况，即单一背景下仅发生一个视频事件。然而，扩展到生成多场景视频并非易事，这需要在合理管理场景间逻辑关系的同时，保持关键内容在视频场景间视觉外观的一致性。本文提出一种创新框架VideoStudio，用于生成内容一致的多场景视频。技术上，VideoStudio利用大语言模型将输入提示转换为结构化的多场景脚本，该脚本受益于大语言模型习得的逻辑知识。每个场景的脚本包含描述事件的提示、前景/背景实体以及摄像机运动。VideoStudio识别脚本中的共有实体，并借助大语言模型细化每个实体的描述。生成的实体描述随后输入文生图模型，为每个实体生成参考图像。最终，VideoStudio通过扩散过程生成每个场景视频，该过程综合考虑参考图像、事件描述提示及摄像机运动，从而输出多场景视频。扩散模型将参考图像作为条件输入和对齐依据，以增强多场景视频的内容一致性。大量实验表明，VideoStudio在视觉质量、内容一致性和用户偏好方面均优于当前最先进的视频生成模型。源代码发布于\url{https://github.com/FuchenUSTC/VideoStudio}。