Video tokenization is fundamental to scalable video generation, as the number of tokens directly determines the computational cost and the length of videos that can be modeled. Existing tokenizers mainly improve scalability by compressing videos into fewer tokens, but they often continue to represent persistent content, such as static backgrounds and consistent object appearances, repeatedly across frames and chunks. In this paper, we propose \textbf{TivTok} (\textit{Time-Invariant Tokenizer}), a reuse-aware video tokenizer that makes persistent information reusable across time. TivTok represents a clip with Time-Invariant (TIV) tokens that encode information shared across frames and Time-Variant (TV) tokens that encode frame-specific residuals. To obtain this factorization, we introduce Scope-Induced Factorization (SIF), which assigns different attention scopes to the two token groups: TIV tokens attend to the full clip, whereas each TV token only accesses its corresponding frame together with the TIV tokens. In the decoder, Invariant Broadcasting (IB) reuses the same TIV tokens across frames and chunks for parallel reconstruction and long-video tokenization. Experiments show that TivTok achieves an rFVD of 12.65 on the standard $16{\times}256{\times}256$ benchmark and improves compression efficiency by 2.91$\times$ for 128-frame videos compared with the evaluated baselines, while using only 1.1\% of the tokens required by downsample-based tokenizers in our evaluation.
翻译:视频令牌化是可扩展视频生成的基础,因为令牌数量直接决定了计算成本及可建模视频的长度。现有令牌化方法主要通过将视频压缩为更少令牌来提升可扩展性,但它们在跨帧和片段处理时,仍会反复表示持续内容(如静态背景和一致物体外观)。本文提出**TivTok**(时不变令牌化器),一种可复用感知的视频令牌化器,能将持久信息在时间维度上重复利用。TivTok使用时不变(TIV)令牌编码跨帧共享信息,以及时变(TV)令牌编码帧特定残差,从而表示一个片段。为实现这种分解,我们引入作用域诱导分解(SIF),为两组令牌分配不同的注意力作用域:TIV令牌关注整个片段,而每个TV令牌仅访问其对应帧及TIV令牌。在解码器中,不变广播(IB)跨帧和片段复用相同的TIV令牌,用于并行重建和长视频令牌化。实验表明,在标准$16{\times}256{\times}256$基准测试中,TivTok的rFVD达12.65;与评估基线相比,对128帧视频的压缩效率提升2.91倍,且仅使用基于下采样令牌化器所需令牌数量的1.1%。