LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior

We present LARP, a novel video tokenizer designed to overcome limitations in current video tokenization methods for autoregressive (AR) generative models. Unlike traditional patchwise tokenizers that directly encode local visual patches into discrete tokens, LARP introduces a holistic tokenization scheme that gathers information from the visual content using a set of learned holistic queries. This design allows LARP to capture more global and semantic representations, rather than being limited to local patch-level information. Furthermore, it offers flexibility by supporting an arbitrary number of discrete tokens, enabling adaptive and efficient tokenization based on the specific requirements of the task. To align the discrete token space with downstream AR generation tasks, LARP integrates a lightweight AR transformer as a training-time prior model that predicts the next token on its discrete latent space. By incorporating the prior model during training, LARP learns a latent space that is not only optimized for video reconstruction but is also structured in a way that is more conducive to autoregressive generation. Moreover, this process defines a sequential order for the discrete tokens, progressively pushing them toward an optimal configuration during training, ensuring smoother and more accurate AR generation at inference time. Comprehensive experiments demonstrate LARP's strong performance, achieving state-of-the-art FVD on the UCF101 class-conditional video generation benchmark. LARP enhances the compatibility of AR models with videos and opens up the potential to build unified high-fidelity multimodal large language models (MLLMs).

翻译：本文提出LARP，一种新颖的视频标记化方法，旨在克服当前自回归生成模型中视频标记化技术的局限性。与直接将局部视觉块编码为离散标记的传统分块标记化方法不同，LARP引入了一种整体标记化方案，通过一组学习得到的整体查询从视觉内容中聚合信息。这种设计使LARP能够捕获更具全局性和语义性的表征，而非局限于局部块级信息。此外，该方法支持任意数量的离散标记，能够根据任务特定需求实现自适应且高效的标记化。为使离散标记空间与下游自回归生成任务对齐，LARP集成了一种轻量级自回归Transformer作为训练时先验模型，用于在其离散潜在空间中预测下一标记。通过在训练过程中融入先验模型，LARP学习到的潜在空间不仅优化了视频重建效果，还形成了更有利于自回归生成的结构化表征。更重要的是，该过程为离散标记定义了顺序关系，在训练期间逐步推动标记向最优配置演进，从而在推理阶段实现更平滑、更精确的自回归生成。综合实验表明，LARP在UCF101类条件视频生成基准测试中取得了最先进的FVD性能。该方法增强了自回归模型对视频数据的兼容性，为构建统一的高保真多模态大语言模型开辟了新可能。