Recent advancements in video autoencoders (Video AEs) have significantly improved the quality and efficiency of video generation. In this paper, we propose a novel and compact video autoencoder, VidTwin, that decouples video into two distinct latent spaces: Structure latent vectors, which capture overall content and global movement, and Dynamics latent vectors, which represent fine-grained details and rapid movements. Specifically, our approach leverages an Encoder-Decoder backbone, augmented with two submodules for extracting these latent spaces, respectively. The first submodule employs a Q-Former to extract low-frequency motion trends, followed by downsampling blocks to remove redundant content details. The second averages the latent vectors along the spatial dimension to capture rapid motion. Extensive experiments show that VidTwin achieves a high compression rate of 0.20% with high reconstruction quality (PSNR of 28.14 on the MCL-JCV dataset), and performs efficiently and effectively in downstream generative tasks. Moreover, our model demonstrates explainability and scalability, paving the way for future research in video latent representation and generation. Check our project page for more details: https://vidtwin.github.io/.
翻译:近年来,视频自编码器(Video AEs)的发展显著提升了视频生成的质量与效率。本文提出一种新颖且紧凑的视频自编码器——VidTwin,其将视频解耦至两个独立的潜在空间:结构潜在向量(捕捉整体内容与全局运动)和动态潜在向量(表征细粒度细节与快速运动)。具体而言,我们的方法基于编码器-解码器主干网络,并分别通过两个子模块来提取这些潜在空间。第一个子模块采用 Q-Former 提取低频运动趋势,随后通过下采样模块去除冗余内容细节。第二个子模块沿空间维度对潜在向量进行平均以捕获快速运动。大量实验表明,VidTwin 在 MCL-JCV 数据集上实现了 0.20% 的高压缩率与高质量重建(PSNR 达 28.14),并在下游生成任务中表现出高效且优异的性能。此外,我们的模型展现出良好的可解释性与可扩展性,为视频潜在表示与生成的未来研究开辟了道路。更多细节请访问项目页面:https://vidtwin.github.io/。