Visual navigation follows the intuition that humans can navigate without detailed maps. A common approach is interactive exploration while building a topological graph with images at nodes that can be used for planning. Recent variations learn from passive videos and can navigate using complex social and semantic cues. However, a significant number of training videos are needed, large graphs are utilized, and scenes are not unseen since odometry is utilized. We introduce a new approach to visual navigation using feudal learning, which employs a hierarchical structure consisting of a worker agent, a mid-level manager, and a high-level manager. Key to the feudal learning paradigm, agents at each level see a different aspect of the task and operate at different spatial and temporal scales. Two unique modules are developed in this framework. For the high-level manager, we learn a memory proxy map in a self supervised manner to record prior observations in a learned latent space and avoid the use of graphs and odometry. For the mid-level manager, we develop a waypoint network that outputs intermediate subgoals imitating human waypoint selection during local navigation. This waypoint network is pre-trained using a new, small set of teleoperation videos that we make publicly available, with training environments different from testing environments. The resulting feudal navigation network achieves near SOTA performance, while providing a novel no-RL, no-graph, no-odometry, no-metric map approach to the image goal navigation task.
翻译:视觉导航遵循人类无需详细地图即可导航的直觉。常见方法是在构建拓扑图的同时进行交互式探索,图中节点存储的图像可用于路径规划。近期研究变体从被动视频中学习,并能利用复杂的社会与语义线索进行导航。然而,该方法需要大量训练视频、使用大规模图结构,且由于依赖里程计,场景并非完全未知。我们提出一种基于封建学习的视觉导航新方法,采用包含执行智能体、中层管理器与高层管理器的分层结构。封建学习范式的关键在于,各层级智能体感知任务的不同层面,并在不同时空尺度上运作。在此框架下我们开发了两个独特模块:针对高层管理器,我们以自监督方式学习记忆代理地图,将先验观测记录在学习的潜在空间中,从而避免使用图结构与里程计;针对中层管理器,我们开发了航点网络,该网络通过模仿人类局部导航中的航点选择来输出中间子目标。该航点网络使用我们公开的新型小型遥操作视频集进行预训练,其训练环境与测试环境相互独立。最终构建的封建导航网络实现了接近最先进的性能,同时为图像目标导航任务提供了一种无需强化学习、无需图结构、无需里程计、无需度量地图的创新方法。