Diffusion Transformers (DiTs) have recently driven significant progress in text-to-video (T2V) generation. However, generating multiple videos with consistent characters and backgrounds remains a significant challenge. Existing methods typically rely on reference images or extensive training, and often only address character consistency, leaving background consistency to image-to-video models. We introduce BachVid, the first training-free method that achieves consistent video generation without needing any reference images. Our approach is based on a systematic analysis of DiT's attention mechanism and intermediate features, revealing its ability to extract foreground masks and identify matching points during the denoising process. Our method leverages this finding by first generating an identity video and caching the intermediate variables, and then inject these cached variables into corresponding positions in newly generated videos, ensuring both foreground and background consistency across multiple videos. Experimental results demonstrate that BachVid achieves robust consistency in generated videos without requiring additional training, offering a novel and efficient solution for consistent video generation without relying on reference images or additional training.
翻译:扩散变换器(DiTs)近期在文本到视频(T2V)生成领域取得了显著进展。然而,生成具有一致角色和背景的多个视频仍然是一个重大挑战。现有方法通常依赖于参考图像或大量训练,且往往仅解决角色一致性问题,而将背景一致性交由图像到视频模型处理。本文提出巴赫视频,这是首个无需任何参考图像、无需训练即可实现一致性视频生成的方法。我们的方法基于对DiT注意力机制及中间特征的系统分析,揭示了其在去噪过程中提取前景掩膜并识别匹配点的能力。我们利用这一发现,首先生成一个身份视频并缓存中间变量,然后将这些缓存变量注入到新生成视频的对应位置,从而确保多个视频间前景与背景的一致性。实验结果表明,巴赫视频在无需额外训练的情况下,实现了生成视频的鲁棒一致性,为不依赖参考图像或额外训练的一致性视频生成提供了一种新颖且高效的解决方案。