We present VEnhancer, a generative space-time enhancement framework that improves the existing text-to-video results by adding more details in spatial domain and synthetic detailed motion in temporal domain. Given a generated low-quality video, our approach can increase its spatial and temporal resolution simultaneously with arbitrary up-sampling space and time scales through a unified video diffusion model. Furthermore, VEnhancer effectively removes generated spatial artifacts and temporal flickering of generated videos. To achieve this, basing on a pretrained video diffusion model, we train a video ControlNet and inject it to the diffusion model as a condition on low frame-rate and low-resolution videos. To effectively train this video ControlNet, we design space-time data augmentation as well as video-aware conditioning. Benefiting from the above designs, VEnhancer yields to be stable during training and shares an elegant end-to-end training manner. Extensive experiments show that VEnhancer surpasses existing state-of-the-art video super-resolution and space-time super-resolution methods in enhancing AI-generated videos. Moreover, with VEnhancer, exisiting open-source state-of-the-art text-to-video method, VideoCrafter-2, reaches the top one in video generation benchmark -- VBench.
翻译:本文提出VEnhancer,一种生成式时空增强框架,通过在空间域添加更多细节、在时间域合成精细运动,以提升现有文本到视频生成结果的质量。给定一段生成的低质量视频,本方法能够通过统一的视频扩散模型,以任意上采样的空间与时间尺度同步提升其空间分辨率与时间分辨率。此外,VEnhancer能有效消除生成视频中的空间伪影与时间闪烁效应。为实现这一目标,我们在预训练视频扩散模型的基础上,训练了一个视频ControlNet,并将其作为低帧率、低分辨率视频的条件注入扩散模型。为有效训练该视频ControlNet,我们设计了时空数据增强策略及视频感知条件机制。得益于上述设计,VEnhancer在训练过程中表现稳定,并具备优雅的端到端训练特性。大量实验表明,在增强AI生成视频方面,VEnhancer超越了现有最先进的视频超分辨率及时空超分辨率方法。此外,结合VEnhancer后,当前开源的最先进文本到视频方法VideoCrafter-2在视频生成基准测试VBench中达到了综合排名第一。