Spatial reasoning plays a vital role in both human cognition and machine intelligence, prompting new research into language models' (LMs) capabilities in this regard. However, existing benchmarks reveal shortcomings in evaluating qualitative spatial reasoning (QSR). These benchmarks typically present oversimplified scenarios or unclear natural language descriptions, hindering effective evaluation. We present a novel benchmark for assessing QSR in LMs, which is grounded in realistic 3D simulation data, offering a series of diverse room layouts with various objects and their spatial relationships. This approach provides a more detailed and context-rich narrative for spatial reasoning evaluation, diverging from traditional, toy-task-oriented scenarios. Our benchmark encompasses a broad spectrum of qualitative spatial relationships, including topological, directional, and distance relations. These are presented with different viewing points, varied granularities, and density of relation constraints to mimic real-world complexities. A key contribution is our logic-based consistency-checking tool, which enables the assessment of multiple plausible solutions, aligning with real-world scenarios where spatial relationships are often open to interpretation. Our benchmark evaluation of advanced LMs reveals their strengths and limitations in spatial reasoning. They face difficulties with multi-hop spatial reasoning and interpreting a mix of different view descriptions, pointing to areas for future improvement.
翻译:空间推理在人类认知与机器智能中均发挥着至关重要的作用,这促使了针对语言模型在此方面能力的新研究。然而,现有基准在评估定性空间推理方面存在不足。这些基准通常呈现过于简化的场景或模糊的自然语言描述,阻碍了有效评估。我们提出了一个用于评估语言模型中定性空间推理能力的新颖基准,该基准基于真实的3D仿真数据,提供了一系列具有多样物体及其空间关系的不同房间布局。这种方法为空间推理评估提供了更细致、上下文更丰富的叙述,有别于传统的、面向玩具任务的场景。我们的基准涵盖了广泛的定性空间关系,包括拓扑关系、方向关系和距离关系。这些关系通过不同的观察视角、不同的粒度以及不同的关系约束密度来呈现,以模拟现实世界的复杂性。一个关键贡献是我们基于逻辑的一致性检查工具,它能够评估多个合理的解决方案,这与现实世界中空间关系通常存在多种解释的场景相符。我们对先进语言模型的基准评估揭示了它们在空间推理方面的优势与局限。它们在处理多跳空间推理以及解读混合的不同视角描述时面临困难,这指明了未来需要改进的方向。