Embodied navigation holds significant promise for real-world applications such as last-mile delivery. However, most existing approaches are confined to either indoor or outdoor environments and rely heavily on strong assumptions, such as access to precise coordinate systems. While current outdoor methods can guide agents to the vicinity of a target using coarse-grained localization, they fail to enable fine-grained entry through specific building entrances, critically limiting their utility in practical deployment scenarios that require seamless outdoor-to-indoor transitions. To bridge this gap, we introduce a novel task: out-to-in prior-free instruction-driven embodied navigation. This formulation explicitly eliminates reliance on accurate external priors, requiring agents to navigate solely based on egocentric visual observations guided by instructions. To tackle this task, we propose a vision-centric embodied navigation framework that leverages image-based prompts to drive decision-making. Additionally, we present the first open-source dataset for this task, featuring a pipeline that integrates trajectory-conditioned video synthesis into the data generation process. Through extensive experiments, we demonstrate that our proposed method consistently outperforms state-of-the-art baselines across key metrics including success rate and path efficiency.
翻译:具身导航在最后一公里配送等现实应用中具有重要前景。然而,现有方法大多局限于室内或室外单一环境,且严重依赖精确坐标系等强假设。当前室外方法虽能通过粗粒度定位引导智能体抵达目标大致区域,却无法实现通过特定建筑入口的细粒度进入,这在需要无缝室外-室内转换的实际部署场景中严重限制了其实用性。为弥合这一鸿沟,我们提出一项新颖任务:无先验知识的外到内指令驱动具身导航。该形式化方法明确摒弃了对精确外部先验信息的依赖,要求智能体仅基于以指令引导的自我中心视觉观测进行导航。针对此任务,我们提出一种视觉中心的具身导航框架,利用基于图像的提示驱动决策。此外,我们发布了该任务的首个开源数据集,其特色在于将轨迹条件视频合成技术融入数据生成流程的构建管道。通过大量实验,我们证明所提方法在成功率与路径效率等关键指标上均持续优于现有先进基线模型。