Existing video tokenizers typically use the traditional Variational Autoencoder (VAE) architecture for video compression and reconstruction. However, to achieve good performance, its training process often relies on complex multi-stage training tricks that go beyond basic reconstruction loss and KL regularization. Among these tricks, the most challenging is the precise tuning of adversarial training with additional Generative Adversarial Networks (GANs) in the final stage, which can hinder stable convergence. In contrast to GANs, diffusion models offer more stable training processes and can generate higher-quality results. Inspired by these advantages, we propose CDT, a novel Conditioned Diffusion-based video Tokenizer, that replaces the GAN-based decoder with a conditional causal diffusion model. The encoder compresses spatio-temporal information into compact latents, while the decoder reconstructs videos through a reverse diffusion process conditioned on these latents. During inference, we incorporate a feature cache mechanism to generate videos of arbitrary length while maintaining temporal continuity and adopt sampling acceleration technique to enhance efficiency. Trained using only a basic MSE diffusion loss for reconstruction, along with KL term and LPIPS perceptual loss from scratch, extensive experiments demonstrate that CDT achieves state-of-the-art performance in video reconstruction tasks with just a single-step sampling. Even a scaled-down version of CDT (3$\times$ inference speedup) still performs comparably with top baselines. Moreover, the latent video generation model trained with CDT also exhibits superior performance. The source code and pretrained weights are available at https://github.com/ali-vilab/CDT.
翻译:现有的视频标记化方法通常采用传统的变分自编码器(VAE)架构进行视频压缩与重建。然而,为达到良好性能,其训练过程往往依赖于超越基础重建损失与KL正则化的复杂多阶段训练技巧。其中最具挑战性的是在最终阶段引入额外生成对抗网络(GAN)进行对抗训练的精细调优,这可能阻碍训练的稳定收敛。与GAN相比,扩散模型提供了更稳定的训练过程并能生成更高质量的结果。受这些优势启发,我们提出CDT——一种基于条件扩散模型的新型视频标记化器,其使用条件因果扩散模型替代了基于GAN的解码器。编码器将时空信息压缩为紧凑的潜在表示,而解码器则通过以这些潜在表示为条件的反向扩散过程重建视频。在推理阶段,我们引入特征缓存机制以生成任意长度的视频同时保持时间连续性,并采用采样加速技术提升效率。仅使用基础MSE扩散损失进行重建训练,配合KL项与LPIPS感知损失进行端到端训练,大量实验表明CDT仅需单步采样即可在视频重建任务中达到最先进的性能。即使是缩小版的CDT(推理速度提升3倍)仍能与顶级基线方法表现相当。此外,基于CDT训练的潜在视频生成模型也展现出优越性能。源代码与预训练权重已发布于https://github.com/ali-vilab/CDT。