Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details--even when they use hundreds of words to describe a simple image. We introduce Reconstruction Alignment (RecA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense "text prompts," providing rich supervision without captions. Concretely, RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RecA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU-hours, post-training with RecA substantially improves image generation performance on GenEval (0.73$\rightarrow$0.90) and DPGBench (80.93$\rightarrow$88.15), while also boosting editing benchmarks (ImgEdit 3.38$\rightarrow$3.75, GEdit 6.94$\rightarrow$7.25). Notably, RecA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs
翻译:统一多模态模型(UMMs)将视觉理解与生成任务整合于单一架构中。然而,传统训练依赖的图像-文本对(或序列)通常具有稀疏的标注文本,即使使用数百词语描述简单图像,仍会缺失细粒度视觉细节。本文提出重建对齐(RecA)——一种资源高效的后训练方法,该方法利用视觉理解编码器的嵌入向量作为密集"文本提示",在无需标注的情况下提供丰富的监督信号。具体而言,RecA将UMM以其自身的视觉理解嵌入向量为条件,通过自监督重建损失优化模型以重建输入图像,从而实现理解与生成能力的重新对齐。尽管方法简洁,RecA具有广泛适用性:在自回归、掩码自回归及基于扩散的UMM架构中,该方法持续提升生成与编辑的保真度。仅需27 GPU小时的后训练,RecA在GenEval(0.73$\rightarrow$0.90)和DPGBench(80.93$\rightarrow$88.15)基准上显著提升图像生成性能,同时改进编辑任务表现(ImgEdit 3.38$\rightarrow$3.75, GEdit 6.94$\rightarrow$7.25)。值得注意的是,RecA超越了规模更大的开源模型,并能广泛应用于各类UMM架构,确立了其作为高效通用UMM后训练对齐策略的地位。