RoboHop: Segment-based Topological Map Representation for Open-World Visual Navigation

Mapping is crucial for spatial reasoning, planning and robot navigation. Existing approaches range from metric, which require precise geometry-based optimization, to purely topological, where image-as-node based graphs lack explicit object-level reasoning and interconnectivity. In this paper, we propose a novel topological representation of an environment based on "image segments", which are semantically meaningful and open-vocabulary queryable, conferring several advantages over previous works based on pixel-level features. Unlike 3D scene graphs, we create a purely topological graph with segments as nodes, where edges are formed by a) associating segment-level descriptors between pairs of consecutive images and b) connecting neighboring segments within an image using their pixel centroids. This unveils a "continuous sense of a place", defined by inter-image persistence of segments along with their intra-image neighbours. It further enables us to represent and update segment-level descriptors through neighborhood aggregation using graph convolution layers, which improves robot localization based on segment-level retrieval. Using real-world data, we show how our proposed map representation can be used to i) generate navigation plans in the form of "hops over segments" and ii) search for target objects using natural language queries describing spatial relations of objects. Furthermore, we quantitatively analyze data association at the segment level, which underpins inter-image connectivity during mapping and segment-level localization when revisiting the same place. Finally, we show preliminary trials on segment-level `hopping' based zero-shot real-world navigation. Project page with supplementary details: oravus.github.io/RoboHop/

翻译：地图构建对于空间推理、路径规划和机器人导航至关重要。现有方法涵盖从依赖精确几何优化的度量地图，到以图像为节点的纯拓扑图——后者缺乏显式的物体级推理和节点间连通性。本文提出一种基于“图像片段”的新型环境拓扑表示方法，这些片段具有语义意义且可通过开放词汇查询，相比以往基于像素级特征的方法具备多项优势。与3D场景图不同，我们构建了一个以片段为节点的纯拓扑图，其边通过以下方式形成：a) 关联连续图像对之间的片段级描述符，b) 利用像素质心连接图像内的相邻片段。这揭示了一种由片段在图像间的持久性及其在图像内的邻域关系定义的“连续空间感知”。该方法进一步通过图卷积层的邻域聚合实现片段级描述符的表示与更新，从而提升基于片段检索的机器人定位性能。利用真实世界数据，我们展示了所提地图表示可用于：i) 以“片段跳跃”形式生成导航路径，ii) 通过描述物体空间关系的自然语言查询搜索目标对象。此外，我们定量分析了片段级数据关联——这是建图过程中图像间连通性及重访同一地点时片段级定位的基础。最后，我们展示了基于片段“跳跃”的零样本真实世界导航初步实验。项目页面及补充细节：oravus.github.io/RoboHop/