VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM

The recent innovations and breakthroughs in diffusion models have significantly expanded the possibilities of generating high-quality videos for the given prompts. Most existing works tackle the single-scene scenario with only one video event occurring in a single background. Extending to generate multi-scene videos nevertheless is not trivial and necessitates to nicely manage the logic in between while preserving the consistent visual appearance of key content across video scenes. In this paper, we propose a novel framework, namely VideoDrafter, for content-consistent multi-scene video generation. Technically, VideoDrafter leverages Large Language Models (LLM) to convert the input prompt into comprehensive multi-scene script that benefits from the logical knowledge learnt by LLM. The script for each scene includes a prompt describing the event, the foreground/background entities, as well as camera movement. VideoDrafter identifies the common entities throughout the script and asks LLM to detail each entity. The resultant entity description is then fed into a text-to-image model to generate a reference image for each entity. Finally, VideoDrafter outputs a multi-scene video by generating each scene video via a diffusion process that takes the reference images, the descriptive prompt of the event and camera movement into account. The diffusion model incorporates the reference images as the condition and alignment to strengthen the content consistency of multi-scene videos. Extensive experiments demonstrate that VideoDrafter outperforms the SOTA video generation models in terms of visual quality, content consistency, and user preference.

翻译：摘要：扩散模型的最新创新与突破为根据给定提示生成高质量视频显著拓展了可能性。现有工作大多针对单一场景，即仅在一个背景中发生一个视频事件。然而，扩展到生成多场景视频并非易事，需要妥善管理场景间的逻辑关系，同时保持关键内容在视频场景中的视觉外观一致性。本文提出了一种名为VideoDrafter的新框架，用于内容一致性的多场景视频生成。在技术上，VideoDrafter利用大语言模型（LLM）将输入提示转换为全面的多场景脚本，这得益于LLM所学习的逻辑知识。每个场景的脚本包括描述事件的提示、前景/背景实体以及镜头运动。VideoDrafter识别脚本中的共有实体，并要求LLM详细描述每个实体。由此产生的实体描述随后输入文本到图像模型，为每个实体生成参考图像。最后，VideoDrafter通过扩散过程生成每个场景视频，该过程综合参考图像、事件描述性提示和镜头运动，输出多场景视频。扩散模型将参考图像作为条件和对齐依据，以增强多场景视频的内容一致性。大量实验表明，VideoDrafter在视觉质量、内容一致性和用户偏好方面均优于现有的最先进视频生成模型。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日