We introduce UEval, a benchmark to evaluate unified models, i.e., models capable of generating both images and text. UEval comprises 1,000 expert-curated questions that require both images and text in the model output, sourced from 8 real-world tasks. Our curated questions cover a wide range of reasoning types, from step-by-step guides to textbook explanations. Evaluating open-ended multimodal generation is non-trivial, as simple LLM-as-a-judge methods can miss the subtleties. Different from previous works that rely on multimodal Large Language Models (MLLMs) to rate image quality or text accuracy, we design a rubric-based scoring system in UEval. For each question, reference images and text answers are provided to a MLLM to generate an initial rubric, consisting of multiple evaluation criteria, and human experts then refine and validate these rubrics. In total, UEval contains 10,417 validated rubric criteria, enabling scalable and fine-grained automatic scoring. UEval is challenging for current unified models: GPT-5-Thinking scores only 66.4 out of 100, while the best open-source model reaches merely 49.1. We observe that reasoning models often outperform non-reasoning ones, and transferring reasoning traces from a reasoning model to a non-reasoning model significantly narrows the gap. This suggests that reasoning may be important for tasks requiring complex multimodal understanding and generation.
翻译:本文提出UEval基准,旨在评估能够同时生成图像与文本的统一模型。该基准包含1000个由专家精心设计的问题,这些问题均要求模型输出同时包含图像与文本,涵盖8个真实世界任务场景。我们设计的问题覆盖从分步指南到教科书解释的广泛推理类型。由于开放式多模态生成评估具有复杂性,简单的LLM-as-a-judge方法可能忽略细微差异。与先前依赖多模态大语言模型(MLLMs)评估图像质量或文本准确性的研究不同,我们在UEval中设计了基于量规的评分体系:针对每个问题,将参考图像与文本答案输入MLLM生成包含多项评估标准的初始量规,再由人类专家进行优化与验证。UEval共包含10,417条已验证的量规标准,支持可扩展的细粒度自动评分。当前统一模型在UEval上表现欠佳:GPT-5-Thinking仅获66.4分(满分100),最优开源模型得分仅为49.1。研究发现推理模型普遍优于非推理模型,且将推理轨迹从推理模型迁移至非推理模型可显著缩小性能差距,这表明推理能力对于需要复杂多模态理解与生成的任务至关重要。