Generating high-quality videos that synthesize desired realistic content is a challenging task due to their intricate high-dimensionality and complexity of videos. Several recent diffusion-based methods have shown comparable performance by compressing videos to a lower-dimensional latent space, using traditional video autoencoder architecture. However, such method that employ standard frame-wise 2D and 3D convolution fail to fully exploit the spatio-temporal nature of videos. To address this issue, we propose a novel hybrid video diffusion model, called HVDM, which can capture spatio-temporal dependencies more effectively. The HVDM is trained by a hybrid video autoencoder which extracts a disentangled representation of the video including: (i) a global context information captured by a 2D projected latent (ii) a local volume information captured by 3D convolutions with wavelet decomposition (iii) a frequency information for improving the video reconstruction. Based on this disentangled representation, our hybrid autoencoder provide a more comprehensive video latent enriching the generated videos with fine structures and details. Experiments on video generation benchamarks (UCF101, SkyTimelapse, and TaiChi) demonstrate that the proposed approach achieves state-of-the-art video generation quality, showing a wide range of video applications (e.g., long video generation, image-to-video, and video dynamics control).
翻译:生成高质量视频以合成所需逼真内容是一项具有挑战性的任务,这归因于视频固有的高维度和复杂性。近期多种基于扩散的方法通过传统视频自编码器架构将视频压缩至低维潜在空间,展现出可比的性能。然而,这类采用标准帧级2D与3D卷积的方法未能充分利用视频的时空特性。针对此问题,我们提出一种新型混合视频扩散模型HVDM,该模型能够更有效地捕捉时空依赖关系。HVDM通过混合视频自编码器进行训练,该编码器提取视频的解耦表示,包含:(i) 由2D投影潜在变量捕捉的全局上下文信息;(ii) 通过结合小波分解的3D卷积捕捉的局部体信息;(iii) 用于提升视频重建质量的频率信息。基于这种解耦表示,我们的混合自编码器提供了更全面的视频潜在表征,使生成视频具备精细结构与细节。在视频生成基准测试(UCF101、SkyTimelapse、TaiChi)上的实验表明,该方法在视频生成质量上达到了最先进水平,并展现了广泛视频应用(如长视频生成、图像到视频、视频动态控制)的潜力。