Employing a single, unified model (UM) for both visual understanding (image-to-text: I2T) and visual generation (text-to-image: T2I) has opened a new direction in Visual Language Model (VLM) research. While UMs can also support broader unimodal tasks (e.g., text-to-text, image-to-image), we focus on the core cross-modal pair T2I and I2T. Existing evaluation benchmarks consider these capabilities in isolation: FID and GenEval for T2I, and benchmarks such as MME, MMBench for I2T. These isolated single-pass metrics do not reveal cross-consistency: whether a model that "understands" a concept can also "render" it, nor whether semantic meaning is preserved when cycling between image and text modalities. To address this, we introduce the Semantic Drift Protocol (SDP) for Unified Models, a cyclic evaluation protocol that alternates I2T and T2I over multiple generations to quantify semantic drift. We propose two metrics: (i) Mean Cumulative Drift (MCD), an embedding-based measure of overall semantic drift; and (ii) Multi-Generation GenEval (MGG), an object-level compliance score extending GenEval. To assess generalization beyond COCO dataset, which is widely used in training; we create a new benchmark Nocaps+Docci400, sampled from NoCaps and DOCCI and evaluated on seven recent models. SDP reveals substantial variation in cross-modal stability: some models like BAGEL maintain semantic meaning over many alternations, whereas others like VILA-U drift quickly despite strong single-pass scores. Our results highlight SDP as a necessary complement to standard I2T and T2I evaluations. Code is available at https://github.com/mollahsabbir/Semantic-Drift-in-Unified-Models
翻译:采用单一的统一模型(UM)同时处理视觉理解(图像到文本:I2T)和视觉生成(文本到图像:T2I)任务,为视觉语言模型(VLM)研究开辟了新方向。虽然UM也能支持更广泛的单模态任务(例如文本到文本、图像到图像),但本研究聚焦于核心跨模态对T2I和I2T。现有评估基准孤立地考量这些能力:T2I使用FID和GenEval,I2T使用MME、MMBench等基准。这些孤立的单次评估指标无法揭示跨模态一致性:即一个能够“理解”概念的模型是否也能“呈现”该概念,以及在图像与文本模态间循环转换时语义含义是否得以保持。为解决此问题,我们针对统一模型提出了语义漂移协议(SDP),这是一种通过多轮交替执行I2T和T2I来量化语义漂移的循环评估协议。我们提出两项指标:(i)基于嵌入向量的整体语义漂移度量——平均累积漂移(MCD);(ii)扩展GenEval的对象级合规性评分——多代次GenEval(MGG)。为评估模型在广泛用于训练的COCO数据集之外的泛化能力,我们从NoCaps和DOCCI数据集中采样构建了新基准Nocaps+Docci400,并对七个近期模型进行评估。SDP揭示了跨模态稳定性的显著差异:部分模型(如BAGEL)能在多次模态交替中保持语义稳定性,而另一些模型(如VILA-U)尽管单次评估得分优异,却会快速发生语义漂移。我们的研究结果表明,SDP是标准I2T与T2I评估的必要补充。代码发布于https://github.com/mollahsabbir/Semantic-Drift-in-Unified-Models