CV-VAE: A Compatible Video VAE for Latent Generative Video Models

Spatio-temporal compression of videos, utilizing networks such as Variational Autoencoders (VAE), plays a crucial role in OpenAI's SORA and numerous other video generative models. For instance, many LLM-like video models learn the distribution of discrete tokens derived from 3D VAEs within the VQVAE framework, while most diffusion-based video models capture the distribution of continuous latent extracted by 2D VAEs without quantization. The temporal compression is simply realized by uniform frame sampling which results in unsmooth motion between consecutive frames. Currently, there lacks of a commonly used continuous video (3D) VAE for latent diffusion-based video models in the research community. Moreover, since current diffusion-based approaches are often implemented using pre-trained text-to-image (T2I) models, directly training a video VAE without considering the compatibility with existing T2I models will result in a latent space gap between them, which will take huge computational resources for training to bridge the gap even with the T2I models as initialization. To address this issue, we propose a method for training a video VAE of latent video models, namely CV-VAE, whose latent space is compatible with that of a given image VAE, e.g., image VAE of Stable Diffusion (SD). The compatibility is achieved by the proposed novel latent space regularization, which involves formulating a regularization loss using the image VAE. Benefiting from the latent space compatibility, video models can be trained seamlessly from pre-trained T2I or video models in a truly spatio-temporally compressed latent space, rather than simply sampling video frames at equal intervals. With our CV-VAE, existing video models can generate four times more frames with minimal finetuning. Extensive experiments are conducted to demonstrate the effectiveness of the proposed video VAE.

翻译：利用变分自编码器（VAE）等网络对视频进行时空压缩，在OpenAI的SORA及众多其他视频生成模型中发挥着至关重要的作用。例如，许多类LLM视频模型在VQVAE框架内学习源自3D VAE的离散令牌分布，而大多数基于扩散的视频模型则捕获由2D VAE提取的连续潜在表示分布（无需量化）。时间压缩通常仅通过均匀帧采样实现，这会导致连续帧间运动不流畅。目前，研究领域缺乏一种广泛适用于基于潜在扩散的视频模型的连续视频（3D）VAE。此外，由于当前基于扩散的方法常借助预训练的文本到图像（T2I）模型实现，若直接训练视频VAE而不考虑与现有T2I模型的兼容性，将导致二者潜在空间存在间隙。即使以T2I模型作为初始化，弥合该间隙仍需耗费巨大的计算资源进行训练。为解决此问题，我们提出一种训练潜在视频模型视频VAE的方法，即CV-VAE，其潜在空间与给定图像VAE（例如Stable Diffusion的图像VAE）的潜在空间相兼容。该兼容性通过我们提出的新型潜在空间正则化实现，其中利用图像VAE构建正则化损失函数。受益于潜在空间兼容性，视频模型可直接在真正时空压缩的潜在空间中，基于预训练的T2I模型或视频模型进行无缝训练，而非简单进行等间隔视频帧采样。采用我们的CV-VAE，现有视频模型仅需极少量微调即可生成四倍数量的帧。我们通过大量实验验证了所提视频VAE的有效性。