We present Open-MAGVIT2, a family of auto-regressive image generation models ranging from 300M to 1.5B. The Open-MAGVIT2 project produces an open-source replication of Google's MAGVIT-v2 tokenizer, a tokenizer with a super-large codebook (i.e., $2^{18}$ codes), and achieves the state-of-the-art reconstruction performance (1.17 rFID) on ImageNet $256 \times 256$. Furthermore, we explore its application in plain auto-regressive models and validate scalability properties. To assist auto-regressive models in predicting with a super-large vocabulary, we factorize it into two sub-vocabulary of different sizes by asymmetric token factorization, and further introduce "next sub-token prediction" to enhance sub-token interaction for better generation quality. We release all models and codes to foster innovation and creativity in the field of auto-regressive visual generation.
翻译:我们提出了Open-MAGVIT2,这是一个参数量从3亿到15亿的自回归图像生成模型家族。Open-MAGVIT2项目开源复现了谷歌的MAGVIT-v2分词器,该分词器拥有一个超大规模码本(即 $2^{18}$ 个码字),并在ImageNet $256 \times 256$ 数据集上实现了最先进的重建性能(1.17 rFID)。此外,我们探索了其在普通自回归模型中的应用,并验证了其可扩展性。为了帮助自回归模型在超大规模词汇表上进行预测,我们通过非对称令牌分解将其分解为两个不同大小的子词汇表,并进一步引入了“下一子令牌预测”以增强子令牌间的交互,从而提升生成质量。我们发布了所有模型和代码,以促进自回归视觉生成领域的创新与创造。