Encoding video content into compact latent tokens has become a fundamental step in video generation and understanding, driven by the need to address the inherent redundancy in pixel-level representations. Consequently, there is a growing demand for high-performance, open-source video tokenizers as video-centric research gains prominence. We introduce VidTok, a versatile video tokenizer that delivers state-of-the-art performance in both continuous and discrete tokenizations. VidTok incorporates several key advancements over existing approaches: 1) model architecture such as convolutional layers and up/downsampling modules; 2) to address the training instability and codebook collapse commonly associated with conventional Vector Quantization (VQ), we integrate Finite Scalar Quantization (FSQ) into discrete video tokenization; 3) improved training strategies, including a two-stage training process and the use of reduced frame rates. By integrating these advancements, VidTok achieves substantial improvements over existing methods, demonstrating superior performance across multiple metrics, including PSNR, SSIM, LPIPS, and FVD, under standardized evaluation settings.
翻译:将视频内容编码为紧凑的潜在标记已成为视频生成与理解领域的基本步骤,其驱动力在于需要解决像素级表示中固有的冗余问题。因此,随着以视频为中心的研究日益受到重视,对高性能开源视频分词器的需求也在不断增长。我们提出了VidTok,这是一种通用的视频分词器,在连续和离散标记化任务中均能提供最先进的性能。与现有方法相比,VidTok融合了多项关键改进:1)模型架构,如卷积层和上/下采样模块;2)针对传统向量量化(VQ)常伴随的训练不稳定性和码本崩溃问题,我们将有限标量量化(FSQ)集成到离散视频标记化中;3)改进的训练策略,包括两阶段训练过程和降低帧率的使用。通过整合这些进展,VidTok在标准化评估设置下,在包括PSNR、SSIM、LPIPS和FVD在内的多项指标上均展现出卓越性能,相比现有方法取得了显著提升。