We introduce FloorplanQA, a diagnostic benchmark for evaluating spatial reasoning in large-language models (LLMs). FloorplanQA is grounded in structured representations of indoor scenes, such as (e.g., kitchens, living rooms, bedrooms, bathrooms, and others), encoded symbolically in JSON or XML layouts. The benchmark covers core spatial tasks, including distance measurement, visibility, path finding, and object placement within constrained spaces. Our results across a variety of frontier open-source and commercial LLMs reveal that while models may succeed in shallow queries, they often fail to respect physical constraints, preserve spatial coherence, though they remain mostly robust to small spatial perturbations. FloorplanQA uncovers a blind spot in today's LLMs: inconsistent reasoning about indoor layouts. We hope this benchmark inspires new work on language models that can accurately infer and manipulate spatial and geometric properties in practical settings.
翻译:我们提出了FloorplanQA,这是一个用于评估大语言模型(LLMs)空间推理能力的诊断性基准。FloorplanQA基于室内场景(例如厨房、客厅、卧室、浴室等)的结构化表示,这些场景以JSON或XML布局的形式进行符号化编码。该基准涵盖了核心空间任务,包括受限空间内的距离测量、可见性判断、路径规划以及物体放置。我们在多种前沿开源和商业LLMs上的实验结果表明,尽管模型在简单查询上可能成功,但它们往往难以遵守物理约束、保持空间一致性,尽管它们对小的空间扰动大多保持鲁棒。FloorplanQA揭示了当前LLMs的一个盲点:对室内布局的推理存在不一致性。我们希望这一基准能够激励新的研究工作,开发出能够在实际场景中准确推断和操作空间与几何属性的语言模型。