Discrete video VAEs underpin modern text-to-video generation and video understanding systems, yet existing tokenizers typically learn visual codebooks at a single scale with limited vocabularies and shallow language supervision, leading to poor cross-modal alignment and zero-shot transfer. We introduce PyraTok, a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatiotemporal resolutions. PyraTok builds on a pretrained video VAE and a novel Language aligned Pyramidal Quantization (LaPQ) module that discretizes encoder features at several depths using a shared large binary codebook, yielding compact yet expressive video token sequences. To tightly couple visual tokens with language, PyraTok jointly optimizes multi-scale text-guided quantization and a global autoregressive objective over the token hierarchy. Across ten benchmarks, PyraTok delivers state-of-the-art (SOTA) video reconstruction, consistently improves text-to-video quality, and sets new SOTA zero-shot performance on video segmentation, temporal action localization, and video understanding, scaling robustly to up to 4K/8K resolutions.
翻译:离散化视频变分自编码器构成了现代文本到视频生成与视频理解系统的基础,然而现有分词器通常在单一尺度上学习视觉码本,其词汇量有限且语言监督浅层,导致跨模态对齐与零样本迁移能力不佳。我们提出了PyraTok,一种语言对齐的金字塔型分词器,它能够在多个时空分辨率上学习语义结构化的离散隐变量。PyraTok基于预训练的视频变分自编码器和一个新颖的语言对齐金字塔量化模块构建,该模块利用共享的大型二进制码本在多个深度对编码器特征进行离散化,从而生成紧凑而富有表现力的视频令牌序列。为了将视觉令牌与语言紧密耦合,PyraTok联合优化了多尺度文本引导量化以及令牌层级上的全局自回归目标。在十项基准测试中,PyraTok实现了最先进的视频重建效果,持续提升了文本到视频的生成质量,并在视频分割、时序动作定位和视频理解任务上创造了新的零样本性能最优记录,且能稳健地扩展至最高4K/8K分辨率。