Improved Baselines with Representation Autoencoders

Representation Autoencoders (RAE) replace traditional VAE with pretrained vision encoders. In this paper, we systematically investigate several design choices and find three insights which simplify and improve RAE. First, we study a generalized formulation where the representation is defined as sum of the last k encoder layers rather than solely the final layer. This simple change greatly improves reconstruction without encoder finetuning or specialized data (e.g., text, faces). Second, we study the prevalent assumption that RAE (using pretrained representation as encoder) replaces representation alignment (REPA), which distills the same representation to intermediate layers instead. Through large-scale empirical analysis, we uncover a surprising finding: RAE and REPA exhibit complementary working mechanisms, allowing the same representation to be used as both encoder and target for intermediate diffusion layers. Finally, the original RAE struggles with classifier-free guidance (CFG) and requires training a second, weaker diffusion model for AutoGuidance (AG). We show that REPA itself can be viewed as x-prediction in RAE latent space. By simply re-parameterizing the output of the DiT model, it can provide guidance for "free". Overall, RAEv2 leads to more than 10x faster convergence over the original RAE, achieving a state-of-the-art gFID of 1.06 in just 80 epochs on ImageNet-256. On FDr6, RAEv2 achieves a state-of-the-art 2.17 at just 80 epochs compared to the previous best 3.26 (800 epochs) without any post-training. This motivates EPFID@k (epochs to reach unguided gFID < k) as a measure of training efficiency. RAEv2 attains an EPFID@2 of 35 epochs, versus 177 for the original RAE. We also validate our approach across diverse settings for text-to-image generation and navigation world models, showing consistent improvements. The code is available at https://raev2.github.io.

翻译：表示自编码器（RAE）用预训练视觉编码器取代了传统变分自编码器。本文系统研究了若干设计选择，并发现三个简化并改进RAE的洞见。首先，我们研究了一种广义公式，其中表示为最后k个编码器层之和而非仅最后一层。这一简单改动无需编码器微调或专门数据（如文本、人脸）即可大幅提升重建质量。其次，我们探讨了普遍假设：RAE（使用预训练表示作为编码器）可替代表示对齐（REPA），后者将相同表示蒸馏到中间层。通过大规模实证分析，我们发现一个令人惊讶的结果：RAE与REPA存在互补工作机制，使得相同表示可同时作为编码器和中间扩散层的目标。最后，原始RAE在无分类器引导（CFG）中存在困难，需训练第二个更弱的扩散模型用于自动引导（AG）。我们证明REPA本身可视为RAE隐空间中的x预测。通过简单重新参数化DiT模型输出，它可提供"免费"引导。总体而言，RAEv2相比原始RAE实现超10倍收敛加速，在ImageNet-256上仅需80轮训练即达到1.06的先进gFID值。在FDr6基准上，RAEv2在80轮时达到2.17的先进水平，而此前最佳结果（800轮）为3.26且无需任何后训练。这促使以EPFID@k（达到无引导gFID<k所需轮数）作为训练效率度量。RAEv2的EPFID@2为35轮，而原始RAE为177轮。我们还在文本到图像生成与导航世界模型等多种场景中验证了该方法，均展现一致改进。代码见https://raev2.github.io。