DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models

Videos are inherently temporal sequences by their very nature. In this work, we explore the potential of modeling videos in a chronological and scalable manner with autoregressive (AR) language models, inspired by their success in natural language processing. We introduce DiCoDe, a novel approach that leverages Diffusion-Compressed Deep Tokens to generate videos with a language model in an autoregressive manner. Unlike existing methods that employ low-level representations with limited compression rates, DiCoDe utilizes deep tokens with a considerable compression rate (a 1000x reduction in token count). This significant compression is made possible by a tokenizer trained through leveraging the prior knowledge of video diffusion models. Deep tokens enable DiCoDe to employ vanilla AR language models for video generation, akin to translating one visual "language" into another. By treating videos as temporal sequences, DiCoDe fully harnesses the capabilities of language models for autoregressive generation. DiCoDe is scalable using readily available AR architectures, and is capable of generating videos ranging from a few seconds to one minute using only 4 A100 GPUs for training. We evaluate DiCoDe both quantitatively and qualitatively, demonstrating that it performs comparably to existing methods in terms of quality while ensuring efficient training. To showcase its scalability, we release a series of DiCoDe configurations with varying parameter sizes and observe a consistent improvement in performance as the model size increases from 100M to 3B. We believe that DiCoDe's exploration in academia represents a promising initial step toward scalable video modeling with AR language models, paving the way for the development of larger and more powerful video generation models.

翻译：视频本质上是一种时序序列。受自回归语言模型在自然语言处理领域取得成功的启发，本研究探索了用时序且可扩展的自回归语言模型进行视频建模的潜力。我们提出了DiCoDe，一种利用扩散压缩深度令牌，以自回归方式通过语言模型生成视频的新方法。与现有采用压缩率有限的低层次表示的方法不同，DiCoDe利用具有显著压缩率（令牌数量减少1000倍）的深度令牌。这种显著的压缩是通过利用视频扩散模型的先验知识训练的分词器实现的。深度令牌使DiCoDe能够使用标准的自回归语言模型进行视频生成，类似于将一种视觉"语言"翻译成另一种。通过将视频视为时序序列，DiCoDe充分发挥了语言模型在自回归生成方面的能力。DiCoDe可利用现成的自回归架构进行扩展，仅需4块A100 GPU进行训练即可生成数秒至一分钟的视频。我们对DiCoDe进行了定量和定性评估，结果表明其在质量上与现有方法相当，同时保证了训练效率。为展示其可扩展性，我们发布了一系列具有不同参数规模的DiCoDe配置，并观察到当模型规模从1亿参数增加到30亿参数时，性能持续提升。我们认为，DiCoDe在学术界的探索代表了利用自回归语言模型进行可扩展视频建模的一个有前景的初步尝试，为开发更大、更强大的视频生成模型铺平了道路。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日