I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models. However, it still encounters challenges in terms of semantic accuracy, clarity and spatio-temporal continuity. They primarily arise from the scarcity of well-aligned text-video data and the complex inherent structure of videos, making it difficult for the model to simultaneously ensure semantic and qualitative excellence. In this report, we propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors and ensures the alignment of the input data by utilizing static images as a form of crucial guidance. I2VGen-XL consists of two stages: i) the base stage guarantees coherent semantics and preserves content from input images by using two hierarchical encoders, and ii) the refinement stage enhances the video's details by incorporating an additional brief text and improves the resolution to 1280$\times$720. To improve the diversity, we collect around 35 million single-shot text-video pairs and 6 billion text-image pairs to optimize the model. By this means, I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos. Through extensive experiments, we have investigated the underlying principles of I2VGen-XL and compared it with current top methods, which can demonstrate its effectiveness on diverse data. The source code and models will be publicly available at \url{https://i2vgen-xl.github.io}.

翻译：视频合成技术近期得益于扩散模型的快速发展而取得了显著进展，但在语义准确性、清晰度及时空连续性方面仍面临挑战。这些挑战主要源于高质量文本-视频对齐数据的匮乏以及视频本身复杂的结构特性，导致模型难以同时保证语义与质量的双重优异表现。本报告中，我们提出级联式I2VGen-XL方法，通过解耦这两类影响因素，并利用静态图像作为关键引导来确保输入数据的对齐性。I2VGen-XL包含两个阶段：i) 基础阶段通过双层级编码器保证连贯语义并保留输入图像的内容；ii) 精炼阶段通过引入额外简短文本增强视频细节，并将分辨率提升至1280×720。为提升多样性，我们收集约3500万条单镜头文本-视频对和60亿条文本-图像对用于模型优化。通过这种方式，I2VGen-XL可同步提升生成视频的语义准确性、细节连续性与清晰度。通过大量实验，我们探究了I2VGen-XL的基本原理，并与当前顶尖方法进行了对比，验证了其在多样化数据上的有效性。源代码与模型将在\url{https://i2vgen-xl.github.io}公开。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日