Towards Scalable Pre-training of Visual Tokenizers for Generation

The quality of the latent space in visual tokenizers (e.g., VAEs) is crucial for modern generative models. However, the standard reconstruction-based training paradigm produces a latent space that is biased towards low-level information, leading to a foundation flaw: better pixel-level accuracy does not lead to higher-quality generation. This implies that pouring extensive compute into visual tokenizer pre-training translates poorly to improved performance in generation. We identify this as the ``pre-training scaling problem`` and suggest a necessary shift: to be effective for generation, a latent space must concisely represent high-level semantics. We present VTP, a unified visual tokenizer pre-training framework, pioneering the joint optimization of image-text contrastive, self-supervised, and reconstruction losses. Our large-scale study reveals two principal findings: (1) understanding is a key driver of generation, and (2) much better scaling properties, where generative performance scales effectively with compute, parameters, and data allocated to the pretraining of the visual tokenizer. After large-scale pre-training, our tokenizer delivers a competitive profile (78.2 zero-shot accuracy and 0.36 rFID on ImageNet) and 4.1 times faster convergence on generation compared to advanced distillation methods. More importantly, it scales effectively: without modifying standard DiT training specs, solely investing more FLOPS in pretraining VTP achieves 65.8\% FID improvement in downstream generation, while conventional autoencoder stagnates very early at 1/10 FLOPS. Our pre-trained models are available at https://github.com/MiniMax-AI/VTP.

翻译：视觉分词器（如VAEs）的潜在空间质量对于现代生成模型至关重要。然而，基于标准重建的训练范式产生的潜在空间偏向于低级信息，导致一个根本性缺陷：更好的像素级精度并不会带来更高质量的生成。这意味着将大量计算资源投入视觉分词器预训练对生成性能的提升效果甚微。我们将此识别为“预训练扩展问题”，并提出一个必要的转变：为了对生成有效，潜在空间必须简洁地表示高级语义。我们提出了VTP，一个统一的视觉分词器预训练框架，开创性地联合优化图像-文本对比、自监督和重建损失。我们的大规模研究揭示了两个主要发现：（1）理解是生成的关键驱动力，以及（2）显著更好的扩展特性，其中生成性能随着分配给视觉分词器预训练的计算量、参数和数据量而有效提升。经过大规模预训练后，我们的分词器展现出有竞争力的性能（ImageNet上78.2的零样本准确率和0.36的rFID），并且在生成任务上相比先进的蒸馏方法收敛速度快4.1倍。更重要的是，它能够有效扩展：在不修改标准DiT训练规格的情况下，仅通过在预训练VTP中投入更多FLOPS，就在下游生成任务中实现了65.8%的FID改进，而传统的自编码器在仅使用1/10 FLOPS时便早早停滞不前。我们的预训练模型可在https://github.com/MiniMax-AI/VTP获取。