Benchmarking Spatial Relationships in Text-to-Image Generation

Spatial understanding is a fundamental aspect of computer vision and integral for human-level reasoning about images, making it an important component for grounded language understanding. While recent text-to-image synthesis (T2I) models have shown unprecedented improvements in photorealism, it is unclear whether they have reliable spatial understanding capabilities. We investigate the ability of T2I models to generate correct spatial relationships among objects and present VISOR, an evaluation metric that captures how accurately the spatial relationship described in text is generated in the image. To benchmark existing models, we introduce a dataset, $\mathrm{SR}_{2D}$, that contains sentences describing two or more objects and the spatial relationships between them. We construct an automated evaluation pipeline to recognize objects and their spatial relationships, and employ it in a large-scale evaluation of T2I models. Our experiments reveal a surprising finding that, although state-of-the-art T2I models exhibit high image quality, they are severely limited in their ability to generate multiple objects or the specified spatial relations between them. Our analyses demonstrate several biases and artifacts of T2I models such as the difficulty with generating multiple objects, a bias towards generating the first object mentioned, spatially inconsistent outputs for equivalent relationships, and a correlation between object co-occurrence and spatial understanding capabilities. We conduct a human study that shows the alignment between VISOR and human judgement about spatial understanding. We offer the $\mathrm{SR}_{2D}$ dataset and the VISOR metric to the community in support of T2I reasoning research.

翻译：空间理解是计算机视觉的基础，也是实现人类水平图像推理的关键要素，因此成为具象语言理解的重要环节。尽管近期文本到图像合成（T2I）模型在照片真实感方面取得了前所未有的提升，但其是否具备可靠的空间理解能力尚不明确。我们研究了T2I模型生成物体间正确空间关系的能力，并提出评估指标VISOR，该指标可衡量文本描述的空间关系在图像中生成的精确度。为对现有模型进行基准测试，我们引入数据集$\mathrm{SR}_{2D}$，其中包含描述两个及以上物体及其空间关系的句子。我们构建了自动化评估流程以识别物体及其空间关系，并据此对T2I模型进行大规模评估。实验揭示了一个令人意外的发现：尽管先进T2I模型展现出高图像质量，但其生成多物体或指定空间关系的能力严重受限。分析表明T2I模型存在多种偏差与伪影，包括多物体生成困难、倾向于生成首个提及物体、等价关系下空间一致性不足，以及物体共现频率与空间理解能力间的关联。我们开展人类评估研究，证实VISOR与人类空间判断具有高度一致性。为支持T2I推理研究，我们向社区提供$\mathrm{SR}_{2D}$数据集与VISOR评估指标。