Current visual navigation strategies mainly follow an exploration-first and then goal-directed navigation paradigm. This exploratory phase inevitably compromises the overall efficiency of navigation. Recent studies propose leveraging floor plans alongside RGB inputs to guide agents, aiming for rapid navigation without prior exploration or mapping. Key issues persist despite early successes. The modal gap and content misalignment between floor plans and RGB images necessitate an efficient approach to extract the most salient and complementary features from both for reliable navigation. Here, we propose GlocDiff, a novel framework that employs a diffusion-based policy to continuously predict future waypoints. This policy is conditioned on two complementary information streams: (1) local depth cues derived from the current RGB observation, and (2) global directional guidance extracted from the floor plan. The former handles immediate navigation safety by capturing surrounding geometry, while the latter ensures goal-directed efficiency by offering definitive directional cues. Extensive evaluations on the FloNa benchmark demonstrate that GlocDiff achieves superior efficiency and effectiveness. Furthermore, its successful deployment in real-world scenarios underscores its strong potential for broad practical application.
翻译:当前视觉导航策略主要遵循先探索后目标导向的导航范式。这一探索阶段不可避免地影响了导航的整体效率。近期研究提出结合平面图与RGB输入来引导智能体,旨在实现无需预先探索或建图的快速导航。尽管取得初步成功,关键问题依然存在。平面图与RGB图像之间的模态差异与内容错位,需要一种高效方法从两者中提取最显著且互补的特征以实现可靠导航。本文提出GlocDiff——一种采用基于扩散的策略来持续预测未来路径点的新型框架。该策略以两个互补信息流为条件:(1) 从当前RGB观测中提取的局部深度线索,(2) 从平面图中提取的全局方向引导。前者通过捕捉周围几何结构保障即时导航安全,后者通过提供明确方向线索确保目标导向的效率。在FloNa基准上的大量实验表明,GlocDiff实现了卓越的导航效率与有效性。此外,其在真实场景中的成功部署印证了该框架具备广泛实际应用的强大潜力。