Objective-oriented navigation(ObjNav) enables robot to navigate to target object directly and autonomously in an unknown environment. Effective perception in navigation in unknown environment is critical for autonomous robots. While egocentric observations from RGB-D sensors provide abundant local information, real-time top-down maps offer valuable global context for ObjNav. Nevertheless, the majority of existing studies focus on a single source, seldom integrating these two complementary perceptual modalities, despite the fact that humans naturally attend to both. With the rapid advancement of Vision-Language Models(VLMs), we propose Hybrid Perception Navigation (HyPerNav), leveraging VLMs' strong reasoning and vision-language understanding capabilities to jointly perceive both local and global information to enhance the effectiveness and intelligence of navigation in unknown environments. In both massive simulation evaluation and real-world validation, our methods achieved state-of-the-art performance against popular baselines. Benefiting from hybrid perception approach, our method captures richer cues and finds the objects more effectively, by simultaneously leveraging information understanding from egocentric observations and the top-down map. Our ablation study further proved that either of the hybrid perception contributes to the navigation performance.
翻译:目标导向导航(ObjNav)使机器人能够在未知环境中直接、自主地导航至目标物体。在未知环境中进行有效感知对于自主机器人至关重要。虽然来自RGB-D传感器的以自我为中心的观测提供了丰富的局部信息,而实时俯视图地图则为目标导向导航提供了宝贵的全局上下文。然而,现有研究大多聚焦于单一感知源,很少整合这两种互补的感知模态,尽管人类天然地同时关注两者。随着视觉语言模型(VLMs)的快速发展,我们提出了混合感知导航(HyPerNav),利用VLMs强大的推理和视觉语言理解能力,联合感知局部与全局信息,以提升未知环境中导航的效能与智能水平。在大规模仿真评估和实际场景验证中,我们的方法相较于主流基线均取得了最先进的性能。得益于混合感知策略,我们的方法通过同时利用以自我为中心的观测和俯视图地图的信息理解,捕获了更丰富的线索,从而更有效地定位目标物体。消融研究进一步证明,混合感知中的任一组成部分均对导航性能有所贡献。