While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images using a common token vocabulary. Equipped with this new tokenizer, we show that LLMs outperform diffusion models on standard image and video generation benchmarks including ImageNet and Kinetics. In addition, we demonstrate that our tokenizer surpasses the previously top-performing video tokenizer on two more tasks: (1) video compression comparable to the next-generation video codec (VCC) according to human evaluations, and (2) learning effective representations for action recognition tasks.
翻译:虽然大型语言模型(LLMs)是语言生成任务中的主导模型,但在图像和视频生成方面,它们的效果不如扩散模型。为了有效利用LLMs进行视觉生成,一个关键组件是视觉分词器,它将像素空间输入映射为适合LLM学习的离散令牌。本文介绍了MAGVIT-v2,一种视频分词器,旨在使用通用令牌词汇表为视频和图像生成简洁且富有表达力的令牌。借助这一新型分词器,我们证明LLMs在包括ImageNet和Kinetics在内的标准图像与视频生成基准测试中超越了扩散模型。此外,我们展示了该分词器在以下两个任务上超越了此前表现最佳的视频分词器:(1)根据人类评估,视频压缩性能可与下一代视频编解码器(VCC)相媲美;(2)为动作识别任务学习有效表征。