While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images using a common token vocabulary. Equipped with this new tokenizer, we show that LLMs outperform diffusion models on standard image and video generation benchmarks including ImageNet and Kinetics. In addition, we demonstrate that our tokenizer surpasses the previously top-performing video tokenizer on two more tasks: (1) video compression comparable to the next-generation video codec (VCC) according to human evaluations, and (2) learning effective representations for action recognition tasks.
翻译:尽管大型语言模型(LLMs)是语言生成任务中的主导模型,但在图像和视频生成方面,它们表现不如扩散模型。为有效利用LLMs进行视觉生成,一个关键组件是视觉分词器,它将像素空间输入映射为适合LLM学习的离散标记。本文介绍了MAGVIT-v2,一种视频分词器,旨在使用通用标记词汇为视频和图像生成简洁且富有表达力的标记。配备这种新型分词器后,我们证明LLMs在ImageNet和Kinetics等标准图像与视频生成基准测试中优于扩散模型。此外,我们展示该分词器在以下两个任务上超越了先前最佳的视频分词器:(1)根据人类评估,视频压缩效果可与下一代视频编解码器(VCC)相媲美;(2)为动作识别任务学习有效表征。