Text-to-image (T2I) models have achieved remarkable success in generating high-fidelity images, but they often fail in handling complex spatial relationships, e.g., spatial perception, reasoning, or interaction. These critical aspects are largely overlooked by current benchmarks due to their short or information-sparse prompt design. In this paper, we introduce SpatialGenEval, a new benchmark designed to systematically evaluate the spatial intelligence of T2I models, covering two key aspects: (1) SpatialGenEval involves 1,230 long, information-dense prompts across 25 real-world scenes. Each prompt integrates 10 spatial sub-domains and corresponding 10 multi-choice question-answer pairs, ranging from object position and layout to occlusion and causality. Our extensive evaluation of 21 state-of-the-art models reveals that higher-order spatial reasoning remains a primary bottleneck. (2) To demonstrate that the utility of our information-dense design goes beyond simple evaluation, we also construct the SpatialT2I dataset. It contains 15,400 text-image pairs with rewritten prompts to ensure image consistency while preserving information density. Fine-tuned results on current foundation models (i.e., Stable Diffusion-XL, Uniworld-V1, OmniGen2) yield consistent performance gains (+4.2%, +5.7%, +4.4%) and more realistic effects in spatial relations, highlighting a data-centric paradigm to achieve spatial intelligence in T2I models.
翻译:文本到图像(T2I)模型在生成高保真度图像方面取得了显著成功,但在处理复杂空间关系(如空间感知、推理或交互)时仍常出现失误。由于现有基准测试的提示设计通常简短或信息稀疏,这些关键方面在很大程度上被忽视了。本文提出SpatialGenEval——一个旨在系统评估T2I模型空间智能的新基准,涵盖两个关键维度:(1)SpatialGenEval包含25个真实场景下的1,230条长文本、信息密集的提示。每条提示整合了10个空间子领域及对应的10组多选题问答对,内容涵盖物体位置与布局、遮挡关系及因果推理等多个层面。通过对21个前沿模型的广泛评估,我们发现高阶空间推理仍是主要瓶颈。(2)为证明信息密集设计的价值不仅限于简单评估,我们还构建了SpatialT2I数据集。该数据集包含15,400个文本-图像对,通过重写提示确保图像一致性的同时保持信息密度。在现有基础模型(即Stable Diffusion-XL、Uniworld-V1、OmniGen2)上的微调实验显示,模型在空间关系表现上取得了稳定的性能提升(+4.2%、+5.7%、+4.4%)并生成更真实的效果,这凸显了以数据为中心的范式对实现T2I模型空间智能的重要性。