Unified Multimodal Models (UMMs) integrate both visual understanding and generation within a single framework. Their ultimate aspiration is to create a cycle where understanding and generation mutually reinforce each other. While recent post-training methods have successfully leveraged understanding to enhance generation, the reverse direction of utilizing generation to improve understanding remains largely unexplored. In this work, we propose UniMRG (Unified Multi-Representation Generation), a simple yet effective architecture-agnostic post-training method. UniMRG enhances the understanding capabilities of UMMs by incorporating auxiliary generation tasks. Specifically, we train UMMs to generate multiple intrinsic representations of input images, namely pixel (reconstruction), depth (geometry), and segmentation (structure), alongside standard visual understanding objectives. By synthesizing these diverse representations, UMMs capture complementary information regarding appearance, spatial relations, and structural layout. Consequently, UMMs develop a deeper and more comprehensive understanding of visual inputs. Extensive experiments across diverse UMM architectures demonstrate that our method notably enhances fine-grained perception, reduces hallucinations, and improves spatial understanding, while simultaneously boosting generation capabilities.
翻译:统一多模态模型(UMMs)将视觉理解与生成功能整合在单一框架中。其最终目标是构建一个理解与生成相互促进的循环。尽管近期的后训练方法已成功利用理解能力来增强生成性能,但利用生成来提升理解的反向路径仍鲜有探索。本文提出UniMRG(统一多表征生成),一种简洁高效且与架构无关的后训练方法。UniMRG通过引入辅助生成任务来增强UMMs的理解能力。具体而言,我们在标准视觉理解目标的基础上,训练UMMs生成输入图像的多种内在表征,包括像素(重建)、深度(几何)和分割(结构)。通过合成这些多样化表征,UMMs能够捕获关于外观、空间关系和结构布局的互补信息。因此,UMMs得以对视觉输入形成更深入、更全面的理解。在不同UMM架构上的大量实验表明,本方法显著提升了细粒度感知能力,减少了幻觉现象,改善了空间理解,同时同步增强了生成性能。