We present a general and simple text to video model based on Transformer. Since both text and video are sequential data, we encode both texts and images into the same hidden space, which are further fed into Transformer to capture the temporal consistency and then decoder to generate either text or images. Considering the image signal may become weak in the long sequence, we introduce the U-Net to reconstruct image from its noised version. Specifically, we increase the noise level to the original image in the long sequence, then use the $down$ module from U-Net to encode noised images, which are further input to transformer to predict next clear images. We also add a constraint to promote motion between any generated image pair in the video. We use GPT2 and test our approach on UCF101 dataset and show it can generate promising videos.
翻译:我们提出了一种基于Transformer的通用且简易的文本到视频模型。由于文本与视频均属于序列数据,我们将文本和图像编码至同一隐空间,随后输入Transformer以捕获时序一致性,再通过解码器生成文本或图像。考虑到长序列中图像信号可能减弱,我们引入U-Net从含噪版本中重建图像。具体而言,我们提升长序列中原始图像的噪声水平,利用U-Net的下采样模块编码含噪图像,并将其输入Transformer以预测下一帧清晰图像。此外,我们添加约束以增强视频中任意生成图像对之间的运动连贯性。本方法基于GPT2,在UCF101数据集上测试,展示了生成高质量视频的能力。