This work presents VTok, a unified video tokenization framework that can be used for both generation and understanding tasks. Unlike the leading vision-language systems that tokenize videos through a naive frame-sampling strategy, we propose to decouple the spatial and temporal representations of videos by retaining the spatial features of a single key frame while encoding each subsequent frame into a single residual token, achieving compact yet expressive video tokenization. Our experiments suggest that VTok effectively reduces the complexity of video representation from the product of frame count and per-frame token count to their sum, while the residual tokens sufficiently capture viewpoint and motion changes relative to the key frame. Extensive evaluations demonstrate the efficacy and efficiency of VTok: it achieves notably higher performance on a range of video understanding and text-to-video generation benchmarks compared with baselines using naive tokenization, all with shorter token sequences per video (e.g., 3.4% higher accuracy on our TV-Align benchmark and 1.9% higher VBench score). Remarkably, VTok produces more coherent motion and stronger guidance following in text-to-video generation, owing to its more consistent temporal encoding. We hope VTok can serve as a standardized video tokenization paradigm for future research in video understanding and generation.
翻译:本文提出VTok,一种可用于生成与理解任务的统一视频分词框架。不同于主流视觉语言系统通过简单帧采样策略对视频进行分词,我们提出通过保留单个关键帧的空间特征,同时将每个后续帧编码为单个残差标记,从而解耦视频的空间与时间表示,实现紧凑而富有表现力的视频分词。实验表明,VTok有效将视频表示的复杂度从帧数与每帧标记数的乘积降至其和,而残差标记能充分捕捉相对于关键帧的视角与运动变化。广泛评估证明了VTok的有效性与高效性:相比采用简单分词方法的基线模型,VTok在一系列视频理解与文本到视频生成基准测试中均取得显著更高的性能,且每视频标记序列更短(例如,在TV-Align基准上准确率提升3.4%,VBench分数提升1.9%)。值得注意的是,得益于更一致的时间编码,VTok在文本到视频生成中能产生更连贯的运动和更强的指令跟随能力。我们希望VTok能成为未来视频理解与生成研究的标准化视频分词范式。