In this paper, we demonstrate that mobile manipulation policies utilizing a 3D latent map achieve stronger spatial and temporal reasoning than policies relying solely on images. We introduce Seeing the Bigger Picture (SBP), an end-to-end policy learning approach that operates directly on a 3D map of latent features. In SBP, the map extends perception beyond the robot's current field of view and aggregates observations over long horizons. Our mapping approach incrementally fuses multiview observations into a grid of scene-specific latent features. A pre-trained, scene-agnostic decoder reconstructs target embeddings from these features and enables online optimization of the map features during task execution. A policy, trainable with behavior cloning or reinforcement learning, treats the latent map as a state variable and uses global context from the map obtained via a 3D feature aggregator. We evaluate SBP on scene-level mobile manipulation and sequential tabletop manipulation tasks. Our experiments demonstrate that SBP (i) reasons globally over the scene, (ii) leverages the map as long-horizon memory, and (iii) outperforms image-based policies in both in-distribution and novel scenes, e.g., improving the success rate by 15% for the sequential manipulation task.
翻译:本文证明,利用3D潜在映射的移动操作策略相比仅依赖图像的策略具有更强的时空推理能力。我们提出"纵观全局"方法——一种直接在潜在特征3D映射上运行的端到端策略学习框架。该方法通过扩展感知范围至机器人当前视野之外,并聚合长时程观测数据来实现这一目标。我们的映射方法将多视角观测逐步融合为场景特定的潜在特征网格。预训练的场景无关解码器从这些特征中重构目标嵌入,支持任务执行期间对映射特征进行在线优化。可通过行为克隆或强化学习训练的策略将潜在映射视为状态变量,并利用通过3D特征聚合器从映射中获取的全局上下文信息。我们在场景级移动操作和序列化桌面操作任务上评估了SBP方法。实验结果表明,SBP能够:(i) 在场景层面进行全局推理,(ii) 将映射作为长时程记忆系统利用,(iii) 在分布内场景和全新场景中均超越基于图像的策略,例如在序列化操作任务中将成功率提升15%。