Vision-language large models are moving toward the unification of visual understanding and visual generation tasks. However, whether generation can enhance understanding is still under-explored on large data scale. In this work, we analysis the unified model with a concise structure, UniHetero, under large-scale pretraining (>200M samples). Our key observations are: (1) Generation can improve understanding, but Only if you generate Semantics, Not Pixels. (2) Generation reveals a superior Data Scaling trend and higher Data Utilization. (3) Autoregression on Input Embedding is effective to capture visual details.
翻译:视觉语言大模型正朝着视觉理解与视觉生成任务的统一方向发展。然而,在大数据规模下,生成能否增强理解能力仍未得到充分探索。在本工作中,我们分析了一个结构简洁的统一模型——UniHetero,并在大规模预训练(>2亿样本)下进行了研究。我们的关键发现是:(1)生成能够提升理解能力,但前提是生成语义,而非像素。(2)生成展现出更优的数据缩放趋势和更高的数据利用率。(3)在输入嵌入上进行自回归能有效捕捉视觉细节。