How To Not Train Your Dragon: Training-free Embodied Object Goal Navigation with Semantic Frontiers

Object goal navigation is an important problem in Embodied AI that involves guiding the agent to navigate to an instance of the object category in an unknown environment -- typically an indoor scene. Unfortunately, current state-of-the-art methods for this problem rely heavily on data-driven approaches, \eg, end-to-end reinforcement learning, imitation learning, and others. Moreover, such methods are typically costly to train and difficult to debug, leading to a lack of transferability and explainability. Inspired by recent successes in combining classical and learning methods, we present a modular and training-free solution, which embraces more classic approaches, to tackle the object goal navigation problem. Our method builds a structured scene representation based on the classic visual simultaneous localization and mapping (V-SLAM) framework. We then inject semantics into geometric-based frontier exploration to reason about promising areas to search for a goal object. Our structured scene representation comprises a 2D occupancy map, semantic point cloud, and spatial scene graph. Our method propagates semantics on the scene graphs based on language priors and scene statistics to introduce semantic knowledge to the geometric frontiers. With injected semantic priors, the agent can reason about the most promising frontier to explore. The proposed pipeline shows strong experimental performance for object goal navigation on the Gibson benchmark dataset, outperforming the previous state-of-the-art. We also perform comprehensive ablation studies to identify the current bottleneck in the object navigation task.

翻译：物体目标导航是具身人工智能中的一个重要问题，它涉及引导智能体在未知环境（通常为室内场景）中导航至指定物体类别的实例。遗憾的是，当前该领域的最先进方法严重依赖数据驱动方法，例如端到端强化学习、模仿学习等。此外，此类方法通常训练成本高昂且难以调试，导致可迁移性和可解释性不足。受近期结合经典方法与学习方法成功案例的启发，我们提出一种模块化、无需训练的解决方案，该方案借鉴更多经典方法以解决物体目标导航问题。我们的方法基于经典视觉同时定位与建图（V-SLAM）框架构建结构化场景表征。随后，我们将语义注入基于几何的前沿探索中，以推理出搜索目标物体的有前景区域。结构化场景表征包括二维占据栅格地图、语义点云和空间场景图。我们的方法基于语言先验和场景统计，在场景图上传播语义，从而为几何前沿引入语义知识。借助注入的语义先验，智能体能够推理出最值得探索的前沿区域。所提出的流程在Gibson基准数据集上展示了物体目标导航的强劲实验性能，超越了此前的最先进方法。我们还进行了全面的消融研究，以识别当前物体导航任务中的瓶颈。