While Multi-Modal Large Language Models (MLLMs) demonstrate impressive capabilities in general reasoning, their embodied spatial intelligence remains hampered by a "Cartesian Illusion" - a reliance on text-based probability distributions that lack grounded, 3D topological understanding. This limitation is starkly exposed in multi-agent environments, which demand more than just scene perception; they require second-order Theory of Mind (ToM). Specifically, an Agent A must be able to infer Agent B's belief about the environment, governed strictly by Agent B's physical orientation and sensory limitations. In this paper, we probe the limits of two-stage spatial inference in MLLMs through a novel audio-visual task: requiring Agent A to predict Agent B's estimation of A's relative location. To solve this, we propose an Epistemic Sensory Bottleneck module that abandons rigid, rule-based coordinate transformations. Instead, we introduce an Anchor-Based Embodied Spatial Decomposition Chain-of-Thought (CoT). This guides the MLLM through a "geometric-to-semantic" projection, forcing it to first establish B's local coordinate system and then dynamically weight visual and auditory modalities based on whether A falls within B's visual frustum. Extensive evaluations reveal that while current MLLMs fundamentally struggle with spatial symmetry and out-of-view ambiguities (establishing a rigorous zero-shot baseline of 42% accuracy), our sensory-bounded reasoning chain robustly outperforms pure egocentric and allocentric baselines. By systematically benchmarking these perceptual bottlenecks, our work exposes the current limits of MLLM spatial reasoning and establishes a foundational paradigm for epistemic, modality-aware inference in Embodied AI.
翻译:尽管多模态大语言模型在通用推理上展现出令人瞩目的能力,其具身空间智能仍受制于"笛卡尔幻象"——即依赖缺乏具身三维拓扑理解能力、基于文本的概率分布。这一局限性在多智能体环境中尤为凸显,此类任务不仅要求场景感知,更需二阶心智理论:智能体A必须能推断智能体B对环境的信念,且该信念严格受限于B的物理朝向与感官限制。本文通过新颖的视听任务——要求智能体A预测智能体B对A相对位置的估计——探究多模态大语言模型两阶段空间推理的能力边界。为此,我们提出"认知感知瓶颈"模块,摒弃僵化的基于规则的坐标变换,转而引入"锚点引导的具身空间分解思维链"。该思维链引导多模态大语言模型经历"几何到语义"的投影过程:首先建立B的局部坐标系,再根据A是否落在B的视觉视锥内动态加权视觉与听觉模态。广泛评估显示,尽管当前多模态大语言模型在空间对称性与视域外歧义性问题上存在根本性困难(建立严格零样本基线准确率为42%),我们所提出的感知受限推理链稳健地优于纯自我中心与纯异我中心基线。通过系统化地测试这些感知瓶颈,本研究揭示了当前多模态大语言模型空间推理的局限性,并为具身人工智能中面向认知与模态感知的推理确立了基础范式。