We introduce FSVideo, a fast speed transformer-based image-to-video (I2V) diffusion framework. We build our framework on the following key components: 1.) a new video autoencoder with highly-compressed latent space ($64\times64\times4$ spatial-temporal downsampling ratio), achieving competitive reconstruction quality; 2.) a diffusion transformer (DIT) architecture with a new layer memory design to enhance inter-layer information flow and context reuse within DIT, and 3.) a multi-resolution generation strategy via a few-step DIT upsampler to increase video fidelity. Our final model, which contains a 14B DIT base model and a 14B DIT upsampler, achieves competitive performance against other popular open-source models, while being an order of magnitude faster. We discuss our model design as well as training strategies in this report.
翻译:本文提出FSVideo,一种基于Transformer的快速图像到视频(I2V)扩散框架。该框架的核心组件包括:1)一种新型视频自编码器,其潜在空间具有高度压缩特性(时空下采样比为$64\times64\times4$),在保持优异重建质量的同时显著压缩数据;2)采用扩散Transformer(DIT)架构,并引入新型层间记忆设计以增强DIT内部层间信息流与上下文复用;3)通过多分辨率生成策略,利用少步DIT上采样器提升视频保真度。最终模型包含140亿参数的DIT基础模型与140亿参数的DIT上采样器,在性能上与主流开源模型相当的同时,生成速度提升一个数量级。本报告将详细阐述模型架构设计与训练策略。