基于表征自编码器的大规模文本到图像扩散Transformer扩展 (Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders)

Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation. We first scale RAE decoders on the frozen representation encoder (SigLIP-2) beyond ImageNet by training on web, synthetic, and text-rendering data, finding that while scale improves general fidelity, targeted data composition is essential for specific domains like text. We then rigorously stress-test the RAE design choices originally proposed for ImageNet. Our analysis reveals that scaling simplifies the framework: while dimension-dependent noise scheduling remains critical, architectural complexities such as wide diffusion heads and noise-augmented decoding offer negligible benefits at scale Building on this simplified framework, we conduct a controlled comparison of RAE against the state-of-the-art FLUX VAE across diffusion transformer scales from 0.5B to 9.8B parameters. RAEs consistently outperform VAEs during pretraining across all model scales. Further, during finetuning on high-quality datasets, VAE-based models catastrophically overfit after 64 epochs, while RAE models remain stable through 256 epochs and achieve consistently better performance. Across all experiments, RAE-based diffusion models demonstrate faster convergence and better generation quality, establishing RAEs as a simpler and stronger foundation than VAEs for large-scale T2I generation. Additionally, because both visual understanding and generation can operate in a shared representation space, the multimodal model can directly reason over generated latents, opening new possibilities for unified models.

翻译：表征自编码器（RAEs）通过在高层语义潜在空间中训练，已在ImageNet的扩散建模中展现出独特优势。本研究探讨该框架能否扩展至大规模、自由形式的文本到图像（T2I）生成。我们首先在冻结的表征编码器（SigLIP-2）上，利用网络数据、合成数据及文本渲染数据进行RAE解码器扩展，发现虽然规模提升能改善整体保真度，但针对特定领域（如文本）需要精心设计数据构成。随后，我们系统性地对最初为ImageNet提出的RAE设计选择进行压力测试。分析表明，扩展简化了框架：虽然维度相关的噪声调度仍至关重要，但诸如宽扩散头与噪声增强解码等架构复杂性在大规模场景下收益甚微。基于此简化框架，我们在0.5B至9.8B参数规模的扩散Transformer上，对RAE与最先进的FLUX VAE进行了对照比较。在所有模型规模下，RAE在预训练阶段始终优于VAE。此外，在高质量数据集上进行微调时，基于VAE的模型在64轮训练后出现灾难性过拟合，而RAE模型在256轮训练中保持稳定，并持续获得更优性能。所有实验表明，基于RAE的扩散模型具有更快的收敛速度和更好的生成质量，证明RAE是比VAE更简洁且更强大的大规模T2I生成基础框架。值得注意的是，由于视觉理解与生成可在共享表征空间中运行，多模态模型能直接对生成潜在表示进行推理，这为统一模型开辟了新的可能性。