Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear. Existing benchmarks lack a systematic exploration of the specific tasks where generation facilitates understanding. To this end, we introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks, requiring varying degrees of implicit or explicit visual transformations. Extensive evaluation of over 30 models reveals three core findings: 1) Unified models generally underperform their base Vision-Language Models (VLMs), and Generate-then-Answer (GtA) inference typically degrades performance relative to direct inference. 2) Consistent enhancements emerge in spatial intelligence, visual illusions, or multi-round reasoning subtasks, where enhanced spatial and shape perception, as well as multi-step intermediate image states, prove beneficial. 3) Tasks with similar reasoning structures and models sharing architectures exhibit correlated behaviors, suggesting that generation-understanding coupling induces class-consistent inductive biases over tasks, pretraining data, and model architectures. These findings highlight the necessity for more diverse training data and novel paradigms to fully unlock the potential of unified multimodal modeling.
翻译:统一多模态模型近期展现出强大的生成能力,然而生成是否以及何时能够提升理解能力,目前尚不明确。现有基准测试缺乏对生成促进理解的具体任务进行系统性探索。为此,我们提出了UniG2U-Bench,这是一个综合性基准测试,将生成到理解(G2U)的评估划分为7种范式与30个子任务,这些任务需要不同程度的隐式或显式视觉转换。通过对超过30个模型进行广泛评估,我们得出了三个核心发现:1)统一模型通常表现逊于其基础视觉语言模型(VLMs),且“生成后回答”(GtA)推理相较于直接推理通常会降低性能。2)在空间智能、视觉错觉或多轮推理子任务中,模型表现出一致的提升,其中增强的空间与形状感知能力以及多步中间图像状态被证明是有益的。3)具有相似推理结构的任务以及共享架构的模型表现出相关性行为,这表明生成与理解的耦合在任务、预训练数据和模型架构上诱导了类别一致的归纳偏置。这些发现凸显了需要更多样化的训练数据和新颖的范式,以充分释放统一多模态建模的潜力。