OmniNav：面向前瞻性探索与视觉语言导航的统一框架 (OmniNav: A Unified Framework for Prospective Exploration and Visual-Language Navigation)

Embodied navigation presents a core challenge for intelligent robots, requiring the comprehension of visual environments, natural language instructions, and autonomous exploration. Existing models often fall short in offering a unified solution across diverse navigation paradigms, resulting in low success rates and limited generalization. We introduce OmniNav, a unified framework addressing instruct-goal, object-goal, point-goal navigation, and frontier-based exploration within a single architecture. Our approach features a lightweight, low-latency policy that accurately predicts continuous-space waypoints (coordinates and orientations). This policy surpasses action-chunk methods in precision and supports real-world deployment at control frequencies up to 5 Hz. Architecturally, OmniNav employs a fast-slow system design: a fast module generates waypoints using short-horizon visual context and subtasks, while a slow module performs deliberative planning with long-horizon observations and candidate frontiers to select subsequent subgoals and subtasks. This collaboration enhances path efficiency and maintains trajectory coherence, particularly in exploration and memory-intensive scenarios. Crucially, we identify that the primary bottleneck isn't merely navigation policy learning, but a robust understanding of general instructions and objects. To boost generalization, OmniNav integrates large-scale, general-purpose training datasets, including those for image captioning and visual recognition, into a joint multi-task regimen. This significantly improves success rates and robustness. Extensive experiments confirm OmniNav's state-of-the-art performance across various navigation benchmarks, with real-world deployment further validating its efficacy. OmniNav provides practical insights for embodied navigation, charting a scalable path towards versatile, highly generalizable robotic intelligence.

翻译：具身导航是智能机器人面临的核心挑战，需要理解视觉环境、自然语言指令并实现自主探索。现有模型往往无法为多样化导航范式提供统一解决方案，导致成功率低且泛化能力有限。本文提出OmniNav——一个在单一架构内统一处理指令目标、物体目标、点目标导航及前沿探索的框架。我们的方法采用轻量级低延迟策略，能够精确预测连续空间航路点（坐标与朝向）。该策略在精度上超越动作分块方法，并支持高达5 Hz控制频率的真实世界部署。在架构设计上，OmniNav采用快慢双系统：快速模块利用短视域视觉上下文和子任务生成航路点，慢速模块则通过长视域观测与候选前沿进行审慎规划，以选择后续子目标与子任务。这种协作机制提升了路径效率并保持轨迹连贯性，在探索和内存密集型场景中尤为关键。我们明确指出，当前主要瓶颈不仅在于导航策略学习，更在于对通用指令和物体的鲁棒理解。为增强泛化能力，OmniNav将大规模通用训练数据集（包括图像描述和视觉识别数据）整合到联合多任务训练框架中，显著提升了成功率与鲁棒性。大量实验证实OmniNav在多种导航基准测试中达到最先进性能，真实世界部署进一步验证了其有效性。OmniNav为具身导航提供了实用见解，为构建多功能、高泛化性的机器人智能开辟了可扩展路径。