Benchmarking Spatial Relationships in Text-to-Image Generation

Spatial understanding is a fundamental aspect of computer vision and integral for human-level reasoning about images, making it an important component for grounded language understanding. While recent text-to-image synthesis (T2I) models have shown unprecedented improvements in photorealism, it is unclear whether they have reliable spatial understanding capabilities. We investigate the ability of T2I models to generate correct spatial relationships among objects and present VISOR, an evaluation metric that captures how accurately the spatial relationship described in text is generated in the image. To benchmark existing models, we introduce a dataset, SR2D, that contains sentences describing two objects and the spatial relationship between them. We construct an automated evaluation pipeline to recognize objects and their spatial relationships, and employ it in a large-scale evaluation of T2I models. Our experiments reveal a surprising finding that, although state-of-the-art T2I models exhibit high image quality, they are severely limited in their ability to generate multiple objects or the specified spatial relations between them. Our analyses demonstrate several biases and artifacts of T2I models such as the difficulty with generating multiple objects, a bias towards generating the first object mentioned, spatially inconsistent outputs for equivalent relationships, and a correlation between object co-occurrence and spatial understanding capabilities. We conduct a human study that shows the alignment between VISOR and human judgement about spatial understanding. We offer the SR2D dataset and the VISOR metric to the community in support of T2I reasoning research.

翻译：空间理解是计算机视觉的基本方面，也是人类对图像进行高层次推理不可或缺的能力，因此成为具象语言理解的重要组成部分。尽管近期文本到图像合成（T2I）模型在逼真度方面取得了前所未有的进步，但其是否具备可靠的空间理解能力仍不明确。本文研究了T2I模型生成物体间正确空间关系的能力，并提出了VISOR评估指标，该指标可量化文本描述的空间关系在图像生成中的准确性。为对现有模型进行基准测试，我们引入了SR2D数据集，其中包含描述两个物体及其空间关系的句子。我们构建了一个自动化评估流程来识别物体及其空间关系，并将其用于T2I模型的大规模评估。实验揭示了一个令人意外的发现：虽然最先进的T2I模型展现出较高的图像质量，但其在生成多个物体或指定的物体间空间关系方面能力严重不足。我们的分析表明T2I模型存在若干偏差和伪影，例如难以生成多个物体、倾向于先生成文本中提到的第一个物体、对等价关系产生空间不一致的输出，以及物体共现频率与空间理解能力之间存在相关性。我们还进行了一项人工研究，结果表明VISOR与人类对空间理解的判断具有一致性。我们向学术界提供SR2D数据集和VISOR指标，以支持T2I推理研究。