Recent diffusion transformers have demonstrated strong image synthesis capabilities but remain inefficient to train due to weak alignment between generative and discriminative representations. While representation alignment frameworks such as REPA improve convergence by aligning noisy denoising features with pretrained visual encoders, their externally supervised alignment loss is static and lacks adaptivity during training and inference. Existing methods rely on fixed cosine alignment or contrastive objectives, which cannot dynamically balance representation consistency and generation quality, resulting in limited discriminative benefit and failing to optimize alignment in a task-adaptive manner. To address this, we propose VRPO, a reinforcement-based optimization strategy that replaces REPA's static alignment loss with a generative representation policy optimization objective. Instead of enforcing a fixed similarity constraint, VRPO treats representation alignment as a reward-guided process: the model receives adaptive rewards based on generation fidelity, perceptual quality, and semantic coherence between the diffusion features and pretrained visual embeddings. This formulation enables the generator to continuously refine its internal representations toward semantically meaningful directions while improving image quality. Our VRPO-driven training seamlessly integrates into diffusion transformers, introducing negligible computation cost and preserving full compatibility with SiT and DiT architectures. Extensive experiments on ImageNet-256x256 demonstrate that our VRPO-Alignment substantially enhances both convergence and fidelity, achieving up to +1.8 FID improvement and 2.3x faster training compared to REPA under identical compute budgets.
翻译:近期扩散Transformer在图像合成方面展现了强大能力,但由于生成表征与判别表征之间存在弱对齐,其训练效率仍然低下。尽管REPA等表征对齐框架通过将带噪去噪特征与预训练视觉编码器对齐来改善收敛性,但其外部监督的对齐损失是静态的,在训练和推理过程中缺乏自适应性。现有方法依赖固定的余弦对齐或对比学习目标,无法动态平衡表征一致性与生成质量,导致判别性收益有限,且无法以任务自适应方式优化对齐。为解决此问题,我们提出VRPO——一种基于强化学习的优化策略,用生成表征策略优化目标替代REPA的静态对齐损失。VRPO不强制施加固定的相似性约束,而是将表征对齐视为奖励引导过程:模型根据生成保真度、感知质量以及扩散特征与预训练视觉嵌入之间的语义一致性获得自适应奖励。这种公式化使得生成器在提升图像质量的同时,能持续将内部表征向语义有意义的精细化方向优化。我们的VRPO驱动训练可无缝集成到扩散Transformer中,引入的计算成本可忽略不计,且与SiT和DiT架构完全兼容。在ImageNet-256x256上的大量实验表明,我们的VRPO对齐在相同计算预算下,相比REPA实现了高达+1.8的FID提升和2.3倍的训练加速,显著增强了收敛性与保真度。