We propose Latent-Shift -- an efficient text-to-video generation method based on a pretrained text-to-image generation model that consists of an autoencoder and a U-Net diffusion model. Learning a video diffusion model in the latent space is much more efficient than in the pixel space. The latter is often limited to first generating a low-resolution video followed by a sequence of frame interpolation and super-resolution models, which makes the entire pipeline very complex and computationally expensive. To extend a U-Net from image generation to video generation, prior work proposes to add additional modules like 1D temporal convolution and/or temporal attention layers. In contrast, we propose a parameter-free temporal shift module that can leverage the spatial U-Net as is for video generation. We achieve this by shifting two portions of the feature map channels forward and backward along the temporal dimension. The shifted features of the current frame thus receive the features from the previous and the subsequent frames, enabling motion learning without additional parameters. We show that Latent-Shift achieves comparable or better results while being significantly more efficient. Moreover, Latent-Shift can generate images despite being finetuned for T2V generation.
翻译:我们提出Latent-Shift——一种基于预训练文本到图像生成模型的高效文本到视频生成方法,该模型由自编码器和U-Net扩散模型组成。在潜空间中学习视频扩散模型比在像素空间中高效得多,后者往往需要先生成低分辨率视频,再通过一系列帧插值和超分辨率模型处理,导致整个流程极其复杂且计算成本高昂。为将U-Net从图像生成扩展至视频生成,先前工作提出添加一维时序卷积和/或时序注意力层等额外模块。与此不同,我们提出无需参数的时间平移模块,可直接利用空间U-Net进行视频生成。具体地,我们将特征图通道沿时间维度的两部分分别向前和向后平移,使当前帧的平移特征接收前后帧信息,从而在不增加参数的情况下实现运动学习。实验表明,Latent-Shift在实现相当或更优性能的同时,显著提升了效率。此外,尽管针对文本到视频生成进行微调,Latent-Shift仍能生成图像。