This paper presents EasyAnimate, an advanced method for video generation that leverages the power of transformer architecture for high-performance outcomes. We have expanded the DiT framework originally designed for 2D image synthesis to accommodate the complexities of 3D video generation by incorporating a motion module block. It is used to capture temporal dynamics, thereby ensuring the production of consistent frames and seamless motion transitions. The motion module can be adapted to various DiT baseline methods to generate video with different styles. It can also generate videos with different frame rates and resolutions during both training and inference phases, suitable for both images and videos. Moreover, we introduce slice VAE, a novel approach to condense the temporal axis, facilitating the generation of long duration videos. Currently, EasyAnimate exhibits the proficiency to generate videos with 144 frames. We provide a holistic ecosystem for video production based on DiT, encompassing aspects such as data pre-processing, VAE training, DiT models training (both the baseline model and LoRA model), and end-to-end video inference. Code is available at: https://github.com/aigc-apps/EasyAnimate. We are continuously working to enhance the performance of our method.
翻译:本文提出了EasyAnimate,这是一种先进的视频生成方法,它利用Transformer架构的强大能力来实现高性能的视频生成效果。我们通过引入一个运动模块块,将最初为2D图像合成设计的DiT框架扩展到能够处理3D视频生成的复杂性。该模块用于捕捉时序动态,从而确保生成连贯的帧序列和流畅的运动过渡。该运动模块可以适配到不同的DiT基线方法,以生成不同风格的视频。在训练和推理阶段,它还能够生成具有不同帧率和分辨率的视频,适用于图像和视频两种模态。此外,我们引入了切片VAE,这是一种压缩时间轴的新方法,有助于生成长时视频。目前,EasyAnimate已具备生成144帧视频的能力。我们提供了一个基于DiT的完整视频生成生态系统,涵盖了数据预处理、VAE训练、DiT模型训练(包括基线模型和LoRA模型)以及端到端的视频推理。代码发布于:https://github.com/aigc-apps/EasyAnimate。我们正在持续努力提升本方法的性能。