Recent advancements in video autoencoders (Video AEs) have significantly improved the quality and efficiency of video generation. In this paper, we propose a novel and compact video autoencoder, VidTwin, that decouples video into two distinct latent spaces: Structure latent vectors, which capture overall content and global movement, and Dynamics latent vectors, which represent fine-grained details and rapid movements. Specifically, our approach leverages an Encoder-Decoder backbone, augmented with two submodules for extracting these latent spaces, respectively. The first submodule employs a Q-Former to extract low-frequency motion trends, followed by downsampling blocks to remove redundant content details. The second averages the latent vectors along the spatial dimension to capture rapid motion. Extensive experiments show that VidTwin achieves a high compression rate of 0.20% with high reconstruction quality (PSNR of 28.14 on the MCL-JCV dataset), and performs efficiently and effectively in downstream generative tasks. Moreover, our model demonstrates explainability and scalability, paving the way for future research in video latent representation and generation. Our code has been released at https://github.com/microsoft/VidTok/tree/main/vidtwin.
翻译:近期视频自编码器(Video AEs)的进展显著提升了视频生成的质量与效率。本文提出了一种新颖且紧凑的视频自编码器VidTwin,它将视频解耦至两个独立的潜在空间:结构潜在向量(捕捉整体内容与全局运动)与动态潜在向量(表征细粒度细节与快速运动)。具体而言,我们的方法基于编码器-解码器主干网络,并分别通过两个子模块增强以提取上述潜在空间。第一个子模块采用Q-Former提取低频运动趋势,随后通过下采样模块去除冗余内容细节;第二个子模块沿空间维度对潜在向量进行平均以捕获快速运动。大量实验表明,VidTwin在实现0.20%高压缩率的同时保持了高质量重建(在MCL-JCV数据集上PSNR达28.14),并在下游生成任务中表现出高效且优异的性能。此外,我们的模型展现出良好的可解释性与可扩展性,为视频潜在表示与生成的未来研究开辟了道路。代码已发布于https://github.com/microsoft/VidTok/tree/main/vidtwin。