Text-image-to-video (TI2V) generation is a critical problem for controllable video generation using both semantic and visual conditions. Most existing methods typically add visual conditions to text-to-video (T2V) foundation models by finetuning, which is costly in resources and only limited to a few pre-defined conditioning settings. To tackle these constraints, we introduce a unified formulation for TI2V generation with flexible visual conditioning. Furthermore, we propose an innovative training-free approach, dubbed FlexTI2V, that can condition T2V foundation models on an arbitrary amount of images at arbitrary positions. Specifically, we firstly invert the condition images to noisy representation in a latent space. Then, in the denoising process of T2V models, our method uses a novel random patch swapping strategy to incorporate visual features into video representations through local image patches. To balance creativity and fidelity, we use a dynamic control mechanism to adjust the strength of visual conditioning to each video frame. Extensive experiments validate that our method surpasses previous training-free image conditioning methods by a notable margin. Our method can also generalize to both UNet-based and transformer-based architectures.
翻译:文本-图像到视频(TI2V)生成是利用语义和视觉条件进行可控视频生成的关键问题。现有方法通常通过微调将视觉条件添加到文本到视频(T2V)基础模型中,这种方法资源消耗大且仅限于少数预定义的条件设置。为解决这些限制,我们提出了一种支持灵活视觉条件化的TI2V生成统一框架。此外,我们提出了一种创新的无需训练方法,称为FlexTI2V,它能够在任意位置、以任意数量的图像作为条件来引导T2V基础模型。具体而言,我们首先将条件图像在潜在空间中反转为噪声表示。然后,在T2V模型的去噪过程中,我们的方法采用一种新颖的随机块交换策略,通过局部图像块将视觉特征融入视频表示。为平衡创造性与保真度,我们使用动态控制机制来调整每帧视频的视觉条件化强度。大量实验验证表明,我们的方法显著超越了以往无需训练的视觉条件化方法。该方法还能同时推广到基于UNet和基于Transformer的架构。