Current video generation models excel at short clips but fail to produce cohesive multi-shot narratives due to disjointed visual dynamics and fractured storylines. Existing solutions either rely on extensive manual scripting/editing or prioritize single-shot fidelity over cross-scene continuity, limiting their practicality for movie-like content. We introduce VideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shot video synthesis from a single sentence by systematically addressing three core challenges: (1) Narrative Fragmentation: Existing methods lack structured storytelling. We propose dynamic storyline modeling, which first converts the user prompt into concise shot descriptions, then elaborates them into detailed, cinematic specifications across five domains (character dynamics, background continuity, relationship evolution, camera movements, HDR lighting), ensuring logical narrative progression with self-validation. (2) Visual Inconsistency: Existing approaches struggle with maintaining visual consistency across shots. Our identity-aware cross-shot propagation generates identity-preserving portrait (IPP) tokens that maintain character fidelity while allowing trait variations (expressions, aging) dictated by the storyline. (3) Transition Artifacts: Abrupt shot changes disrupt immersion. Our adjacent latent transition mechanisms implement boundary-aware reset strategies that process adjacent shots' features at transition points, enabling seamless visual flow while preserving narrative continuity. VGoT generates multi-shot videos that outperform state-of-the-art baselines by 20.4% in within-shot face consistency and 17.4% in style consistency, while achieving over 100% better cross-shot consistency and 10x fewer manual adjustments than alternatives.
翻译:当前视频生成模型擅长生成短视频片段,但由于视觉动态脱节和故事情节断裂,难以生成连贯的多镜头叙事。现有解决方案要么依赖大量人工脚本编写/编辑,要么优先考虑单镜头保真度而非跨场景连续性,限制了其在电影式内容创作中的实用性。我们提出了VideoGen-of-Thought(VGoT),这是一个逐步生成框架,通过系统性地解决三个核心挑战,实现了从单个句子自动合成多镜头视频:(1)叙事碎片化:现有方法缺乏结构化叙事能力。我们提出动态故事情节建模,首先将用户提示转换为简洁的镜头描述,然后在五个维度(角色动态、背景连续性、关系演变、摄像机运动、HDR光照)上将其细化为详细的电影化规范,并通过自验证确保逻辑叙事推进。(2)视觉不一致性:现有方法难以维持跨镜头视觉一致性。我们的身份感知跨镜头传播机制生成身份保持肖像(IPP)令牌,在保持角色保真度的同时,允许根据故事情节进行特征变化(表情、年龄变化)。(3)过渡伪影:生硬的镜头切换会破坏沉浸感。我们的相邻潜在过渡机制采用边界感知重置策略,在过渡点处理相邻镜头的特征,在保持叙事连续性的同时实现无缝视觉流转。VGoT生成的多镜头视频在镜头内人脸一致性上优于最先进基线20.4%,在风格一致性上优于17.4%,同时实现了超过100%的跨镜头一致性提升,且所需人工调整量仅为替代方案的十分之一。