We propose a novel inference technique based on a pretrained diffusion model for text-conditional video generation. Our approach, called FIFO-Diffusion, is conceptually capable of generating infinitely long videos without additional training. This is achieved by iteratively performing diagonal denoising, which concurrently processes a series of consecutive frames with increasing noise levels in a queue; our method dequeues a fully denoised frame at the head while enqueuing a new random noise frame at the tail. However, diagonal denoising is a double-edged sword as the frames near the tail can take advantage of cleaner ones by forward reference but such a strategy induces the discrepancy between training and inference. Hence, we introduce latent partitioning to reduce the training-inference gap and lookahead denoising to leverage the benefit of forward referencing. Practically, FIFO-Diffusion consumes a constant amount of memory regardless of the target video length given a baseline model, while well-suited for parallel inference on multiple GPUs. We have demonstrated the promising results and effectiveness of the proposed methods on existing text-to-video generation baselines. Generated video samples and source codes are available at our project page.
翻译:我们提出了一种基于预训练扩散模型的新型推理技术,用于文本条件视频生成。我们的方法称为FIFO-Diffusion,在概念上能够无需额外训练即可生成无限长的视频。这是通过迭代执行对角去噪实现的,该方法在一个队列中同时处理一系列具有递增噪声水平的连续帧;我们的方法在队列头部出队一个完全去噪的帧,同时在尾部入队一个新的随机噪声帧。然而,对角去噪是一把双刃剑:队列尾部的帧可以通过前向参考利用更干净的帧,但这种策略会导致训练与推理之间的差异。因此,我们引入了潜在分区以减少训练-推理差距,并采用前瞻去噪以利用前向参考的优势。在实际应用中,给定一个基线模型,FIFO-Diffusion无论目标视频长度如何,都消耗恒定的内存量,同时非常适合在多GPU上进行并行推理。我们在现有的文本到视频生成基线上展示了所提方法的良好结果和有效性。生成的视频样本和源代码可在我们的项目页面获取。