Videos are inherently temporal sequences by their very nature. In this work, we explore the potential of modeling videos in a chronological and scalable manner with autoregressive (AR) language models, inspired by their success in natural language processing. We introduce DiCoDe, a novel approach that leverages Diffusion-Compressed Deep Tokens to generate videos with a language model in an autoregressive manner. Unlike existing methods that employ low-level representations with limited compression rates, DiCoDe utilizes deep tokens with a considerable compression rate (a 1000x reduction in token count). This significant compression is made possible by a tokenizer trained through leveraging the prior knowledge of video diffusion models. Deep tokens enable DiCoDe to employ vanilla AR language models for video generation, akin to translating one visual "language" into another. By treating videos as temporal sequences, DiCoDe fully harnesses the capabilities of language models for autoregressive generation. DiCoDe is scalable using readily available AR architectures, and is capable of generating videos ranging from a few seconds to one minute using only 4 A100 GPUs for training. We evaluate DiCoDe both quantitatively and qualitatively, demonstrating that it performs comparably to existing methods in terms of quality while ensuring efficient training. To showcase its scalability, we release a series of DiCoDe configurations with varying parameter sizes and observe a consistent improvement in performance as the model size increases from 100M to 3B. We believe that DiCoDe's exploration in academia represents a promising initial step toward scalable video modeling with AR language models, paving the way for the development of larger and more powerful video generation models.
翻译:视频本质上是一种时序序列。受自回归语言模型在自然语言处理领域取得成功的启发,本研究探索了用时序且可扩展的自回归语言模型进行视频建模的潜力。我们提出了DiCoDe,一种利用扩散压缩深度令牌,以自回归方式通过语言模型生成视频的新方法。与现有采用压缩率有限的低层次表示的方法不同,DiCoDe利用具有显著压缩率(令牌数量减少1000倍)的深度令牌。这种显著的压缩是通过利用视频扩散模型的先验知识训练的分词器实现的。深度令牌使DiCoDe能够使用标准的自回归语言模型进行视频生成,类似于将一种视觉"语言"翻译成另一种。通过将视频视为时序序列,DiCoDe充分发挥了语言模型在自回归生成方面的能力。DiCoDe可利用现成的自回归架构进行扩展,仅需4块A100 GPU进行训练即可生成数秒至一分钟的视频。我们对DiCoDe进行了定量和定性评估,结果表明其在质量上与现有方法相当,同时保证了训练效率。为展示其可扩展性,我们发布了一系列具有不同参数规模的DiCoDe配置,并观察到当模型规模从1亿参数增加到30亿参数时,性能持续提升。我们认为,DiCoDe在学术界的探索代表了利用自回归语言模型进行可扩展视频建模的一个有前景的初步尝试,为开发更大、更强大的视频生成模型铺平了道路。