VideoGen-of-Thought：以最小人工干预逐步生成多镜头视频 (VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention)

from arxiv, This paper should be a refined version of arXiv:2412.02259, "VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation", but I mistakenly submit it as a new paper

Current video generation models excel at short clips but fail to produce cohesive multi-shot narratives due to disjointed visual dynamics and fractured storylines. Existing solutions either rely on extensive manual scripting/editing or prioritize single-shot fidelity over cross-scene continuity, limiting their practicality for movie-like content. We introduce VideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shot video synthesis from a single sentence by systematically addressing three core challenges: (1) Narrative Fragmentation: Existing methods lack structured storytelling. We propose dynamic storyline modeling, which first converts the user prompt into concise shot descriptions, then elaborates them into detailed, cinematic specifications across five domains (character dynamics, background continuity, relationship evolution, camera movements, HDR lighting), ensuring logical narrative progression with self-validation. (2) Visual Inconsistency: Existing approaches struggle with maintaining visual consistency across shots. Our identity-aware cross-shot propagation generates identity-preserving portrait (IPP) tokens that maintain character fidelity while allowing trait variations (expressions, aging) dictated by the storyline. (3) Transition Artifacts: Abrupt shot changes disrupt immersion. Our adjacent latent transition mechanisms implement boundary-aware reset strategies that process adjacent shots' features at transition points, enabling seamless visual flow while preserving narrative continuity. VGoT generates multi-shot videos that outperform state-of-the-art baselines by 20.4% in within-shot face consistency and 17.4% in style consistency, while achieving over 100% better cross-shot consistency and 10x fewer manual adjustments than alternatives.

翻译：当前视频生成模型擅长生成短视频片段，但由于视觉动态脱节和故事情节断裂，难以生成连贯的多镜头叙事。现有解决方案要么依赖大量人工脚本编写/编辑，要么优先考虑单镜头保真度而非跨场景连续性，限制了其在电影式内容创作中的实用性。我们提出了VideoGen-of-Thought（VGoT），这是一个逐步生成框架，通过系统性地解决三个核心挑战，实现了从单个句子自动合成多镜头视频：（1）叙事碎片化：现有方法缺乏结构化叙事能力。我们提出动态故事情节建模，首先将用户提示转换为简洁的镜头描述，然后在五个维度（角色动态、背景连续性、关系演变、摄像机运动、HDR光照）上将其细化为详细的电影化规范，并通过自验证确保逻辑叙事推进。（2）视觉不一致性：现有方法难以维持跨镜头视觉一致性。我们的身份感知跨镜头传播机制生成身份保持肖像（IPP）令牌，在保持角色保真度的同时，允许根据故事情节进行特征变化（表情、年龄变化）。（3）过渡伪影：生硬的镜头切换会破坏沉浸感。我们的相邻潜在过渡机制采用边界感知重置策略，在过渡点处理相邻镜头的特征，在保持叙事连续性的同时实现无缝视觉流转。VGoT生成的多镜头视频在镜头内人脸一致性上优于最先进基线20.4%，在风格一致性上优于17.4%，同时实现了超过100%的跨镜头一致性提升，且所需人工调整量仅为替代方案的十分之一。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日