Multimodal large language models (MLLMs) achieve strong performance on single-view spatial reasoning tasks, yet it remains unclear whether they maintain stable spatial state representations under counterfactual viewpoint changes. We introduce a controlled diagnostic benchmark that evaluates relational consistency under hypothetical camera orbit transformations without re-rendering images. Across 100 synthetic scenes and 6,000 relational queries, we measure viewpoint consistency, 360° cycle agreement, and relational stability over sequential transformations. Despite high single-view accuracy, state-of-the-art MLLMs exhibit systematic degradation under counterfactual viewpoint changes, with frequent violations of cycle consistency and rapid decay in relational stability. We further evaluate multiple input representations, visual input, textual bounding boxes, and structured scene graphs, and show that increasing representational structure improves stability. Our results suggest that single-view spatial accuracy overestimates the robustness of induced spatial representations and that representation structure plays a critical role in counterfactual spatial reasoning.
翻译:多模态大语言模型(MLLMs)在单视角空间推理任务上表现出色,但尚不清楚它们在反事实视角变化下是否能维持稳定的空间状态表征。我们引入了一个可控的诊断基准测试,在无需重新渲染图像的情况下,评估假设相机轨道变换下的关系一致性。基于100个合成场景和6000个关系查询,我们衡量了视角一致性、360°环路一致性以及序列变换下的关系稳定性。尽管单视角精度较高,但最先进的MLLMs在反事实视角变化下表现出系统性退化,频繁违反环路一致性且关系稳定性快速衰减。我们进一步评估了多种输入表征(视觉输入、文本边界框及结构化场景图),结果表明增加表征结构能提升稳定性。我们的研究提示单视角空间精度高估了诱导性空间表征的鲁棒性,而表征结构在反事实空间推理中起关键作用。