Latent variable generative models have emerged as powerful tools for generative tasks including image and video synthesis. These models are enabled by pretrained autoencoders that map high resolution data into a compressed lower dimensional latent space, where the generative models can subsequently be developed while requiring fewer computational resources. Despite their effectiveness, the direct application of latent variable models to higher dimensional domains such as videos continues to pose challenges for efficient training and inference. In this paper, we propose an autoencoder that projects volumetric data onto a four-plane factorized latent space that grows sublinearly with the input size, making it ideal for higher dimensional data like videos. The design of our factorized model supports straightforward adoption in a number of conditional generation tasks with latent diffusion models (LDMs), such as class-conditional generation, frame prediction, and video interpolation. Our results show that the proposed four-plane latent space retains a rich representation needed for high-fidelity reconstructions despite the heavy compression, while simultaneously enabling LDMs to operate with significant improvements in speed and memory.
翻译:潜变量生成模型已成为图像与视频合成等生成任务中的强大工具。这些模型通过预训练的自编码器实现,后者将高分辨率数据映射至压缩的低维潜空间,使得后续生成模型的开发可在较少计算资源需求下进行。尽管效果显著,将潜变量模型直接应用于视频等高维领域仍对高效训练与推理构成挑战。本文提出一种自编码器,将体数据投影至四平面因子化潜空间,该空间随输入尺寸呈亚线性增长,因而特别适用于视频等高维数据。我们因子化模型的设计支持其在多种潜扩散模型(LDMs)条件生成任务中的直接应用,例如类别条件生成、帧预测与视频插值。实验结果表明,所提出的四平面潜空间在重度压缩下仍能保持高保真重建所需的丰富表征,同时使LDMs在运行速度与内存效率方面获得显著提升。