Deploying Multimodal Large Language Models as the brain of embodied agents remains challenging, particularly under long-horizon observations and limited context budgets. Existing memory assisted methods often rely on textual summaries, which discard rich visual and spatial details and remain brittle in non-stationary environments. In this work, we propose a non-parametric memory framework that explicitly disentangles episodic and semantic memory for embodied exploration and question answering. Our retrieval-first, reasoning-assisted paradigm recalls episodic experiences via semantic similarity and verifies them through visual reasoning, enabling robust reuse of past observations without rigid geometric alignment. In parallel, we introduce a program-style rule extraction mechanism that converts experiences into structured, reusable semantic memory, facilitating cross-environment generalization. Extensive experiments demonstrate state-of-the-art performance on embodied question answering and exploration benchmarks, yielding a 7.3% gain in LLM-Match and an 11.4% gain in LLM MatchXSPL on A-EQA, as well as +7.7% success rate and +6.8% SPL on GOAT-Bench. Analyses reveal that our episodic memory primarily improves exploration efficiency, while semantic memory strengthens complex reasoning of embodied agents.
翻译:将多模态大语言模型部署为具身智能体的“大脑”仍面临挑战,尤其是在长时程观测与有限上下文预算的场景下。现有的记忆辅助方法通常依赖文本摘要,这丢弃了丰富的视觉与空间细节,且在非平稳环境中表现脆弱。本文提出一种非参数化记忆框架,显式解耦情景记忆与语义记忆以支持具身探索与问答任务。我们采用检索优先、推理辅助的范式,通过语义相似性召回情景经验,并借助视觉推理进行验证,从而实现对过往观测的鲁棒复用,无需严格的几何对齐。同时,我们引入一种程序式规则提取机制,将经验转化为结构化、可复用的语义记忆,以促进跨环境泛化能力。大量实验表明,本方法在具身问答与探索基准测试中取得了最先进的性能:在A-EQA数据集上,LLM-Match指标提升7.3%,LLM MatchXSPL指标提升11.4%;在GOAT-Bench基准上,成功率提升7.7%,SPL指标提升6.8%。分析表明,我们的情景记忆主要提升了探索效率,而语义记忆则增强了具身智能体的复杂推理能力。