Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models. However, it still encounters challenges in terms of semantic accuracy, clarity and spatio-temporal continuity. They primarily arise from the scarcity of well-aligned text-video data and the complex inherent structure of videos, making it difficult for the model to simultaneously ensure semantic and qualitative excellence. In this report, we propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors and ensures the alignment of the input data by utilizing static images as a form of crucial guidance. I2VGen-XL consists of two stages: i) the base stage guarantees coherent semantics and preserves content from input images by using two hierarchical encoders, and ii) the refinement stage enhances the video's details by incorporating an additional brief text and improves the resolution to 1280$\times$720. To improve the diversity, we collect around 35 million single-shot text-video pairs and 6 billion text-image pairs to optimize the model. By this means, I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos. Through extensive experiments, we have investigated the underlying principles of I2VGen-XL and compared it with current top methods, which can demonstrate its effectiveness on diverse data. The source code and models will be publicly available at \url{https://i2vgen-xl.github.io}.
翻译:视频合成技术近期得益于扩散模型的快速发展而取得了显著进展,但在语义准确性、清晰度及时空连续性方面仍面临挑战。这些挑战主要源于高质量文本-视频对齐数据的匮乏以及视频本身复杂的结构特性,导致模型难以同时保证语义与质量的双重优异表现。本报告中,我们提出级联式I2VGen-XL方法,通过解耦这两类影响因素,并利用静态图像作为关键引导来确保输入数据的对齐性。I2VGen-XL包含两个阶段:i) 基础阶段通过双层级编码器保证连贯语义并保留输入图像的内容;ii) 精炼阶段通过引入额外简短文本增强视频细节,并将分辨率提升至1280×720。为提升多样性,我们收集约3500万条单镜头文本-视频对和60亿条文本-图像对用于模型优化。通过这种方式,I2VGen-XL可同步提升生成视频的语义准确性、细节连续性与清晰度。通过大量实验,我们探究了I2VGen-XL的基本原理,并与当前顶尖方法进行了对比,验证了其在多样化数据上的有效性。源代码与模型将在\url{https://i2vgen-xl.github.io}公开。