Current video generation models excel at generating short clips but still struggle with creating multi-shot, movie-like videos. Existing models trained on large-scale data on the back of rich computational resources are unsurprisingly inadequate for maintaining a logical storyline and visual consistency across multiple shots of a cohesive script since they are often trained with a single-shot objective. To this end, we propose VideoGen-of-Thought (VGoT), a collaborative and training-free architecture designed specifically for multi-shot video generation. VGoT is designed with three goals in mind as follows. Multi-Shot Video Generation: We divide the video generation process into a structured, modular sequence, including (1) Script Generation, which translates a curt story into detailed prompts for each shot; (2) Keyframe Generation, responsible for creating visually consistent keyframes faithful to character portrayals; and (3) Shot-Level Video Generation, which transforms information from scripts and keyframes into shots; (4) Smoothing Mechanism that ensures a consistent multi-shot output. Reasonable Narrative Design: Inspired by cinematic scriptwriting, our prompt generation approach spans five key domains, ensuring logical consistency, character development, and narrative flow across the entire video. Cross-Shot Consistency: We ensure temporal and identity consistency by leveraging identity-preserving (IP) embeddings across shots, which are automatically created from the narrative. Additionally, we incorporate a cross-shot smoothing mechanism, which integrates a reset boundary that effectively combines latent features from adjacent shots, resulting in smooth transitions and maintaining visual coherence throughout the video. Our experiments demonstrate that VGoT surpasses existing video generation methods in producing high-quality, coherent, multi-shot videos.
翻译:当前视频生成模型在生成短视频片段方面表现出色,但在创建多镜头、电影式视频方面仍面临挑战。现有模型依托丰富的计算资源在大规模数据上进行训练,但由于其训练目标通常针对单镜头生成,因此在保持连贯剧本中多个镜头间的逻辑故事线和视觉一致性方面存在不足,这并不令人意外。为此,我们提出了VideoGen-of-Thought(VGoT),一种专门为多镜头视频生成设计的、无需训练的协作式架构。VGoT的设计遵循以下三个目标:多镜头视频生成:我们将视频生成过程划分为结构化的模块化序列,包括(1)剧本生成,将简短故事转化为每个镜头的详细提示;(2)关键帧生成,负责创建符合角色描绘且视觉一致的关键帧;(3)镜头级视频生成,将剧本和关键帧信息转化为具体镜头;(4)平滑机制,确保多镜头输出的连贯性。合理叙事设计:受电影剧本创作启发,我们的提示生成方法涵盖五个关键领域,确保整个视频的逻辑一致性、角色发展和叙事流畅性。跨镜头一致性:我们通过利用跨镜头身份保持(IP)嵌入来确保时间和身份一致性,这些嵌入根据叙事内容自动生成。此外,我们引入了跨镜头平滑机制,该机制整合了一个重置边界,能有效融合相邻镜头的潜在特征,从而实现平滑过渡并保持整个视频的视觉连贯性。我们的实验表明,VGoT在生成高质量、连贯的多镜头视频方面超越了现有的视频生成方法。