Embodied navigation has long been fragmented by task-specific architectures. We introduce ABot-N0, a unified Vision-Language-Action (VLA) foundation model that achieves a ``Grand Unification'' across 5 core tasks: Point-Goal, Object-Goal, Instruction-Following, POI-Goal, and Person-Following. ABot-N0 utilizes a hierarchical ``Brain-Action'' architecture, pairing an LLM-based Cognitive Brain for semantic reasoning with a Flow Matching-based Action Expert for precise, continuous trajectory generation. To support large-scale learning, we developed the ABot-N0 Data Engine, curating 16.9M expert trajectories and 5.0M reasoning samples across 7,802 high-fidelity 3D scenes (10.7 $\text{km}^2$). ABot-N0 achieves new SOTA performance across 7 benchmarks, significantly outperforming specialized models. Furthermore, our Agentic Navigation System integrates a planner with hierarchical topological memory, enabling robust, long-horizon missions in dynamic real-world environments.
翻译:具身导航领域长期因任务特定架构而处于割裂状态。我们提出了ABot-N0,一个统一的视觉-语言-动作基础模型,实现了在五个核心任务上的“大一统”:点目标导航、物体目标导航、指令跟随、兴趣点目标导航和人员跟随。ABot-N0采用了一种分层的“大脑-动作”架构,将基于大语言模型的认知大脑(用于语义推理)与基于流匹配的动作专家(用于生成精确、连续的轨迹)相结合。为支持大规模学习,我们开发了ABot-N0数据引擎,在7,802个高保真3D场景(总面积10.7 $\text{km}^2$)中,精心构建了1,690万条专家轨迹和500万个推理样本。ABot-N0在7个基准测试中均取得了新的最先进性能,显著优于专用模型。此外,我们的自主导航系统集成了规划器与分层拓扑记忆,能够在动态的真实世界环境中执行鲁棒的长时程任务。