Generating videos for visual storytelling can be a tedious and complex process that typically requires either live-action filming or graphics animation rendering. To bypass these challenges, our key idea is to utilize the abundance of existing video clips and synthesize a coherent storytelling video by customizing their appearances. We achieve this by developing a framework comprised of two functional modules: (i) Motion Structure Retrieval, which provides video candidates with desired scene or motion context described by query texts, and (ii) Structure-Guided Text-to-Video Synthesis, which generates plot-aligned videos under the guidance of motion structure and text prompts. For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure. For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters. The videos are synthesized by following the structural guidance and appearance instruction. To ensure visual consistency across clips, we propose an effective concept personalization approach, which allows the specification of the desired character identities through text prompts. Extensive experiments demonstrate that our approach exhibits significant advantages over various existing baselines.
翻译:为视觉叙事生成视频通常是一项繁琐复杂的过程,往往需要实景拍摄或图形动画渲染。为解决这些挑战,我们的核心思路是利用现有的大量视频片段,通过定制其外观来合成连贯的叙事视频。我们通过开发一个包含两个功能模块的框架实现这一目标:(i)运动结构检索模块——根据查询文本描述的期望场景或运动语境,提供候选视频;(ii)结构引导的文本到视频合成模块——在运动结构和文本提示的引导下生成与情节对齐的视频。针对第一个模块,我们利用现成的视频检索系统,并提取视频深度作为运动结构。针对第二个模块,我们提出一种可控视频生成模型,可对结构和角色进行灵活控制。视频通过遵循结构引导和外观指令进行合成。为确保剪辑片段间的视觉一致性,我们提出一种有效的概念个性化方法,可通过文本提示指定期望的角色身份。大量实验表明,我们的方法相较于现有多种基线方法具有显著优势。