Autoregressive transformers have shown remarkable success in video generation. However, the transformers are prohibited from directly learning the long-term dependency in videos due to the quadratic complexity of self-attention, and inherently suffering from slow inference time and error propagation due to the autoregressive process. In this paper, we propose Memory-efficient Bidirectional Transformer (MeBT) for end-to-end learning of long-term dependency in videos and fast inference. Based on recent advances in bidirectional transformers, our method learns to decode the entire spatio-temporal volume of a video in parallel from partially observed patches. The proposed transformer achieves a linear time complexity in both encoding and decoding, by projecting observable context tokens into a fixed number of latent tokens and conditioning them to decode the masked tokens through the cross-attention. Empowered by linear complexity and bidirectional modeling, our method demonstrates significant improvement over the autoregressive Transformers for generating moderately long videos in both quality and speed. Videos and code are available at https://sites.google.com/view/mebt-cvpr2023 .
翻译:自回归Transformer在视频生成领域取得了显著成功。然而,由于自注意力机制的二次复杂度,Transformer无法直接学习视频中的长期依赖关系,并且由于自回归过程,其推理速度缓慢且存在误差传播问题。本文提出了一种内存高效双向Transformer(MeBT),用于端到端学习视频中的长期依赖关系并实现快速推理。基于双向Transformer的最新进展,我们的方法学会从部分观测的补丁中并行解码整个视频的时空体。通过将可观测上下文令牌投影到固定数量的潜令牌中,并利用交叉注意力机制使其解码掩码令牌,所提出的Transformer在编码和解码过程中均实现了线性时间复杂度。借助线性复杂度和双向建模的优势,我们的方法在中等长度视频生成的质量和速度上均显著优于自回归Transformer。视频和代码已开源在https://sites.google.com/view/mebt-cvpr2023。