Video tokenizers, which transform videos into compact latent representations, are key to video generation. Existing video tokenizers are based on the VAE architecture and follow a paradigm where an encoder compresses videos into compact latents, and a deterministic decoder reconstructs the original videos from these latents. In this paper, we propose a novel \underline{\textbf{C}}onditioned \underline{\textbf{D}}iffusion-based video \underline{\textbf{T}}okenizer entitled \textbf{\ourmethod}, which departs from previous methods by replacing the deterministic decoder with a 3D causal diffusion model. The reverse diffusion generative process of the decoder is conditioned on the latent representations derived via the encoder. With a feature caching and sampling acceleration, the framework efficiently reconstructs high-fidelity videos of arbitrary lengths. Results show that {\ourmethod} achieves state-of-the-art performance in video reconstruction tasks using just a single-step sampling. Even a smaller version of {\ourmethod} still achieves reconstruction results on par with the top two baselines. Furthermore, the latent video generation model trained using {\ourmethod} also shows superior performance.
翻译:视频标记器是将视频转化为紧凑潜在表示的关键技术,对视频生成至关重要。现有的视频标记器基于VAE架构,遵循编码器将视频压缩为紧凑潜在表示、确定性解码器从这些潜在表示重建原始视频的范式。本文提出了一种新颖的基于条件扩散的视频标记器(CDT),其通过用三维因果扩散模型替代确定性解码器,突破了传统方法的局限。解码器的反向扩散生成过程以编码器导出的潜在表示作为条件。通过特征缓存与采样加速技术,该框架能够高效重建任意长度的高保真视频。实验结果表明,CDT仅需单步采样即可在视频重建任务中达到最先进的性能。即使缩小规模的CDT版本,其重建效果仍与表现最佳的两个基线模型相当。此外,基于CDT训练的潜在视频生成模型也展现出卓越的性能。