In light of recent advances in multimodal Large Language Models (LLMs), there is increasing attention to scaling them from image-text data to more informative real-world videos. Compared to static images, video poses unique challenges for effective large-scale pre-training due to the modeling of its spatiotemporal dynamics. In this paper, we address such limitations in video-language pre-training with an efficient video decomposition that represents each video as keyframes and temporal motions. These are then adapted to an LLM using well-designed tokenizers that discretize visual and temporal information as a few tokens, thus enabling unified generative pre-training of videos, images, and text. At inference, the generated tokens from the LLM are carefully recovered to the original continuous pixel space to create various video content. Our proposed framework is both capable of comprehending and generating image and video content, as demonstrated by its competitive performance across 13 multimodal benchmarks in image and video understanding and generation. Our code and models are available at https://video-lavit.github.io.
翻译:鉴于多模态大语言模型(LLMs)的最新进展,人们日益关注将其从图像-文本数据扩展到信息量更大的真实世界视频。与静态图像相比,视频因其时空动态的建模特性,给有效的大规模预训练带来了独特挑战。本文通过一种高效的视频分解方法来解决视频-语言预训练中的这些局限,该方法将每个视频表示为关键帧和时间运动。随后,通过精心设计的标记器将这些信息适配到LLM中,这些标记器将视觉和时间信息离散化为少量标记,从而实现对视频、图像和文本的统一生成式预训练。在推理阶段,LLM生成的标记会被精心恢复至原始连续像素空间,以创建各种视频内容。我们提出的框架既能理解也能生成图像与视频内容,这通过其在13个多模态基准测试(涵盖图像/视频理解与生成任务)中具有竞争力的性能得到了验证。我们的代码和模型可在 https://video-lavit.github.io 获取。