As embodied AI transitions to real-world deployment, the success of the Vision-and-Language Navigation (VLN) task tends to evolve from mere reachability to social compliance. However, current agents suffer from a "goal-driven trap", prioritizing physical geometry ("can I go?") over semantic rules ("may I go?"), frequently overlooking subtle regulatory constraints. To bridge this gap, we establish Rule-VLN, the first large-scale urban benchmark for rule-compliant navigation. Spanning a massive 29k-node environment, it injects 177 diverse regulatory categories into 8k constrained nodes across four curriculum levels, challenging agents with fine-grained visual and behavioral constraints. We further propose the Semantic Navigation Rectification Module (SNRM), a universal, zero-shot module designed to equip pre-trained agents with safety awareness. SNRM integrates a coarse-to-fine visual perception VLM framework with an epistemic mental map for dynamic detour planning. Experiments demonstrate that while Rule-VLN challenges state-of-the-art models, SNRM significantly restores navigation capabilities, reducing CVR by 19.26% and boosting TC by 5.97%.
翻译:随着具身智能向真实场景部署过渡,视觉-语言导航任务的成功标准正从单纯的可达性转向社会合规性。然而,现有智能体陷入"目标驱动陷阱",优先考虑物理几何("我能过去吗?")而非语义规则("我可以过去吗?"),常忽略细微的规制约束。为弥合这一鸿沟,我们构建了Rule-VLN——首个面向规则合规导航的大规模城市基准测试。该基准覆盖29k节点的庞大规模环境,在四个课程难度级别中,向8k约束节点注入177类差异化规制类别,对智能体施加细粒度视觉与行为约束。我们进一步提出语义导航校正模块——一种即插即用的零样本模块,旨在为预训练智能体赋予安全感知能力。该模块融合了从粗到精的视觉感知VLM框架与认知心智地图,用于动态迂回路径规划。实验表明:Rule-VLN虽对现有最优模型构成挑战,但SNRM能显著恢复导航能力,将碰撞违规率降低19.26%,并将任务完成率提升5.97%。