Multi-Shot Character Consistency for Text-to-Video Generation

Text-to-video models have made significant strides in generating short video clips from textual descriptions. Yet, a significant challenge remains: generating several video shots of the same characters, preserving their identity without hurting video quality, dynamics, and responsiveness to text prompts. We present Video Storyboarding, a training-free method to enable pretrained text-to-video models to generate multiple shots with consistent characters, by sharing features between them. Our key insight is that self-attention query features (Q) encode both motion and identity. This creates a hard-to-avoid trade-off between preserving character identity and making videos dynamic, when features are shared. To address this issue, we introduce a novel query injection strategy that balances identity preservation and natural motion retention. This approach improves upon naive consistency techniques applied to videos, which often struggle to maintain this delicate equilibrium. Our experiments demonstrate significant improvements in character consistency across scenes while maintaining high-quality motion and text alignment. These results offer insights into critical stages of video generation and the interplay of structure and motion in video diffusion models.

翻译：文本到视频模型在根据文本描述生成短视频片段方面取得了显著进展。然而，一个重大挑战依然存在：生成同一角色的多个视频镜头，在保持角色身份的同时不损害视频质量、动态性以及对文本提示的响应能力。我们提出了视频故事板技术，这是一种无需训练的方法，能够通过共享特征使预训练的文本到视频模型生成具有一致角色的多个镜头。我们的关键见解是，自注意力查询特征（Q）同时编码了运动和身份信息。当特征被共享时，这导致在保持角色身份和实现视频动态性之间难以避免的权衡。为了解决这一问题，我们引入了一种新颖的查询注入策略，以平衡身份保持和自然运动保留。这种方法改进了应用于视频的简单一致性技术，后者往往难以维持这种微妙的平衡。我们的实验表明，在保持高质量运动和文本对齐的同时，跨场景的角色一致性得到了显著改善。这些结果为视频生成的关键阶段以及视频扩散模型中结构与运动的相互作用提供了深入见解。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日