We present Open-MAGVIT2, a family of auto-regressive image generation models ranging from 300M to 1.5B. The Open-MAGVIT2 project produces an open-source replication of Google's MAGVIT-v2 tokenizer, a tokenizer with a super-large codebook (i.e., $2^{18}$ codes), and achieves the state-of-the-art reconstruction performance (1.17 rFID) on ImageNet $256 \times 256$. Furthermore, we explore its application in plain auto-regressive models and validate scalability properties. To assist auto-regressive models in predicting with a super-large vocabulary, we factorize it into two sub-vocabulary of different sizes by asymmetric token factorization, and further introduce "next sub-token prediction" to enhance sub-token interaction for better generation quality. We release all models and codes to foster innovation and creativity in the field of auto-regressive visual generation.
翻译:我们提出了Open-MAGVIT2,这是一个参数量从3亿到15亿的自回归图像生成模型系列。Open-MAGVIT2项目实现了对谷歌MAGVIT-v2分词器的开源复现,该分词器拥有一个超大码本(即$2^{18}$个码字),并在ImageNet $256 \times 256$数据集上达到了最先进的重建性能(1.17 rFID)。此外,我们探索了其在普通自回归模型中的应用,并验证了其可扩展性。为了帮助自回归模型在超大词汇表上进行预测,我们通过非对称令牌分解将其分解为两个不同大小的子词汇表,并进一步引入"下一个子令牌预测"以增强子令牌间的交互,从而获得更好的生成质量。我们开源了所有模型和代码,以促进自回归视觉生成领域的创新与创造。