Spatial intelligence is a key frontier for multimodal large language models (MLLMs), enabling them to reason about the physical world from visual experience. Inspired by human spatial cognition, recent approaches construct grid-based cognitive maps from multi-frame visual inputs to maintain coherent spatial representations over time. However, limited context lengths still challenge spatial understanding, while existing methods, such as long-context modeling and external memory, often require architectural changes, memory modules, or finetuning, limiting their applicability to off-the-shelf pretrained MLLMs. This motivates a lightweight, model-agnostic method for preserving spatial information beyond the native context window. To this end, we propose a plug-and-play multi-agent framework that collaboratively constructs cognitive maps as structured spatial memory, enhancing the spatial understanding of arbitrary pretrained MLLMs without architectural modification or additional training. Our framework features local-global agent coordination, cognitive map construction with atomic commits, and cross-agent verification. Extensive experiments demonstrate that our method achieves superior performance on spatial understanding tasks while remaining fully training-free. Code will be released.
翻译:空间智能是多模态大语言模型的核心前沿能力,使其能够基于视觉经验对物理世界进行推理。受人类空间认知启发,近期方法通过多帧视觉输入构建基于网格的认知地图,以维持随时间演化的连贯空间表征。然而,有限的上下文长度仍制约空间理解能力,而现有方法(如长上下文建模和外部记忆)往往需要架构改造、记忆模块或微调,限制了其对现成预训练多模态大语言模型的适用性。这促使我们探索一种轻量级、与模型无关的方法以保留超越原生上下文窗口的空间信息。为此,我们提出一种即插即用的多智能体框架,通过协作构建结构化的空间记忆认知地图,在不改变架构或额外训练的条件下提升任意预训练多模态大语言模型的空间理解能力。该框架包含局部-全局智能体协作、基于原子提交的认知地图构建以及跨智能体验证机制。大量实验表明,本方法在空间理解任务上取得优越性能,且完全无需训练。代码将开源。