Constructing compact and informative 3D scene representations is essential for effective embodied exploration and reasoning, especially in complex environments over extended periods. Existing representations, such as object-centric 3D scene graphs, oversimplify spatial relationships by modeling scenes as isolated objects with restrictive textual relationships, making it difficult to address queries requiring nuanced spatial understanding. Moreover, these representations lack natural mechanisms for active exploration and memory management, hindering their application to lifelong autonomy. In this work, we propose 3D-Mem, a novel 3D scene memory framework for embodied agents. 3D-Mem employs informative multi-view images, termed Memory Snapshots, to represent the scene and capture rich visual information of explored regions. It further integrates frontier-based exploration by introducing Frontier Snapshots-glimpses of unexplored areas-enabling agents to make informed decisions by considering both known and potential new information. To support lifelong memory in active exploration settings, we present an incremental construction pipeline for 3D-Mem, as well as a memory retrieval technique for memory management. Experimental results on three benchmarks demonstrate that 3D-Mem significantly enhances agents' exploration and reasoning capabilities in 3D environments, highlighting its potential for advancing applications in embodied AI.
翻译:构建紧凑且信息丰富的三维场景表示对于有效的具身探索与推理至关重要,尤其是在复杂环境中进行长期活动时。现有表示方法(如以对象为中心的三维场景图)通过将场景建模为具有受限文本关系的孤立对象,过度简化了空间关系,使得难以处理需要细致空间理解的查询。此外,这些表示缺乏主动探索与记忆管理的自然机制,阻碍了其在终身自主性中的应用。本工作中,我们提出了3D-Mem——一种用于具身智能体的新型三维场景记忆框架。3D-Mem采用信息丰富的多视角图像(称为记忆快照)来表示场景并捕获已探索区域的丰富视觉信息。它通过引入前沿快照(即未探索区域的局部视角)进一步整合基于前沿的探索,使智能体能够通过综合考虑已知信息和潜在新信息来做出明智决策。为支持主动探索场景下的终身记忆,我们提出了3D-Mem的增量构建流程以及用于记忆管理的记忆检索技术。在三个基准测试上的实验结果表明,3D-Mem显著增强了智能体在三维环境中的探索与推理能力,凸显了其在推动具身人工智能应用方面的潜力。