To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator. Despite their promising results, such paradigm is computationally expensive. In this work, we propose a new T2V generation setting$\unicode{x2014}$One-Shot Video Tuning, where only one text-video pair is presented. Our model is built on state-of-the-art T2I diffusion models pre-trained on massive image data. We make two key observations: 1) T2I models can generate still images that represent verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we introduce Tune-A-Video, which involves a tailored spatio-temporal attention mechanism and an efficient one-shot tuning strategy. At inference, we employ DDIM inversion to provide structure guidance for sampling. Extensive qualitative and numerical experiments demonstrate the remarkable ability of our method across various applications.
翻译:为复现文本到图像生成的成功,近期工作采用大规模视频数据集训练文本到视频生成器。尽管取得了令人鼓舞的结果,此类范式计算成本高昂。本文提出一种新的文本到视频生成范式——单次视频调优,仅需提供单个文本-视频对。我们的模型基于在海量图像数据上预训练的先进文本到图像扩散模型构建。我们有两个关键发现:1)文本到图像模型能生成表示动词概念的静态图像;2)扩展文本到图像模型以并行生成多张图像展现出惊人的内容一致性。为进一步学习连续运动,我们提出Tune-A-Video,其核心包含定制的时空注意力机制与高效的单次调优策略。在推理阶段,我们采用DDIM反转为采样提供结构引导。大量定性与数值实验表明,本方法在多种应用场景中具有卓越性能。