Existing Multimodal Large Language Models (MLLMs) struggle with 3D spatial reasoning, as they fail to construct structured abstractions of the 3D environment depicted in video inputs. To bridge this gap, drawing inspiration from cognitive theories of allocentric spatial reasoning, we investigate how to enable MLLMs to model and reason over text-based spatial representations of video. Specifically, we introduce Textual Representation of Allocentric Context from Egocentric Video (TRACE), a prompting method that induces MLLMs to generate text-based representations of 3D environments as intermediate reasoning traces for more accurate spatial question answering. TRACE encodes meta-context, camera trajectories, and detailed object entities to support structured spatial reasoning over egocentric videos. Extensive experiments on VSI-Bench and OST-Bench demonstrate that TRACE yields notable and consistent improvements over prior prompting strategies across a diverse range of MLLM backbones, spanning different parameter scales and training schemas. We further present ablation studies to validate our design choices, along with detailed analyses that probe the bottlenecks of 3D spatial reasoning in MLLMs.
翻译:现有的大语言模型(MLLMs)在三维空间推理方面表现欠佳,因为它们难以从视频输入中构建出三维环境的结构化抽象。为弥补这一缺陷,我们借鉴异我中心空间推理的认知理论,探索如何让MLLMs能够基于文本形式的空间表征来建模和推理视频内容。具体而言,我们提出了"从自我中心视频到异我中心上下文的文本表征"(TRACE)方法——一种提示策略,它诱导MLLMs将三维环境的文本表征生成为中间推理轨迹,从而更准确地进行空间问答。TRACE方法编码了元上下文、相机轨迹和详细物体实体,以支持从自我中心视频出发的结构化空间推理。在VSI-Bench和OST-Bench上的广泛实验表明,与之前的提示策略相比,TRACE方法在多种MLLM骨干网络(涵盖不同参数规模和训练方案)上均取得了显著且一致的性能提升。我们进一步通过消融实验验证了设计选择,并进行了详细分析以探究MLLMs在三维空间推理中的瓶颈。